CN114219026A

CN114219026A - Data processing method and device and computer readable storage medium

Info

Publication number: CN114219026A
Application number: CN202111532415.XA
Authority: CN
Inventors: 弄庆鹏; 李忠良; 屠要峰; 周祥生
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-22
Also published as: WO2023109588A1

Abstract

The invention discloses a data processing method, a device thereof and a computer readable storage medium, wherein the data processing method comprises the steps of receiving a data sample sent by a client, wherein the data sample comprises a sample label; obtaining consistency parameters between the sample labels of the adjacent data samples, determining quantity information of reasonable samples in the data samples according to the consistency parameters, and obtaining sample label rationality judgment values according to the quantity information; obtaining a target judgment value according to the sample label rationality judgment value; and obtaining a target judgment result according to the target judgment value. According to the scheme of the embodiment of the invention, the effectiveness of judging the quality of the data sample can be improved.

Description

Data processing method and device and computer readable storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data processing method and apparatus, and a computer-readable storage medium.

Background

In the field of artificial intelligence, particularly in a machine learning scene, the quality of a data sample directly determines the effectiveness of an algorithm model and the landing performance of the algorithm model in an actual scene, so that the effectiveness of design, collection and marking of the data sample needs to be verified in the process of model development, effective and constructive guidance is provided for optimization of data collection, the quality of a data sample set is improved, and the landing performance of the model in the actual scene is strengthened. However, there is no method provided in the related art that can effectively determine the quality of the data sample.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a data processing method, a data processing device and a computer readable storage medium, which can improve the effectiveness of judging the quality of a data sample.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

receiving a data sample sent by a client, the data sample comprising a sample tag;

obtaining consistency parameters between the sample labels of the adjacent data samples, determining quantity information of reasonable samples in the data samples according to the consistency parameters, and obtaining sample label rationality judgment values according to the quantity information;

obtaining a target judgment value according to the sample label rationality judgment value;

and obtaining a target judgment result according to the target judgment value.

In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the data processing method as described above.

In a third aspect, the embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions for executing the data processing method described above.

The embodiment of the invention comprises the following steps: receiving a data sample sent by a client, wherein the data sample comprises a sample label; obtaining consistency parameters of sample labels between adjacent data samples, determining quantity information of reasonable samples in the data samples according to the consistency parameters, and obtaining sample label rationality judgment values according to the quantity information; obtaining a target judgment value according to the sample label rationality judgment value; and obtaining a target judgment result according to the target judgment value. According to the scheme of the embodiment of the invention, the data samples including the sample labels sent by the client are received, the quantity information of reasonable samples is obtained by utilizing the consistency parameters between the sample labels of the adjacent data samples, the sample label rationality judgment value is obtained according to the quantity information, and the obtained sample label rationality judgment value can reflect whether the setting of the sample labels of the data samples is reasonable or not, so that the quality of the data samples can be effectively judged, the target judgment value is obtained according to the sample label rationality judgment value, and the target judgment result is obtained according to the target judgment value, namely, the scheme of the embodiment of the invention can improve the effectiveness of the judgment on the quality of the data samples.

Additional sample features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of a data processing method provided by an embodiment of the invention;

FIG. 2 is a flowchart of a specific method of step S120 in FIG. 1;

FIG. 3 is a flowchart of a specific method of step S123 in FIG. 2;

FIG. 4 is a flow chart of another embodiment of a specific method of step S123 of FIG. 2;

FIG. 5 is a flow chart of yet another embodiment of the detailed method of step S123 in FIG. 2;

FIG. 6 is a flow chart of yet another embodiment of the detailed method of step S123 in FIG. 2;

FIG. 7 is a flow chart of a data processing method provided by another embodiment of the present invention;

FIG. 8 is a flowchart of a specific method of step S720 in FIG. 7;

FIG. 9 is a flow diagram of another embodiment of a specific method of step S720 in FIG. 7;

FIG. 10 is a flowchart of a specific method of step S840 in FIG. 8 or step S940 in FIG. 9;

FIG. 11 is a flow diagram of another embodiment of a specific method of step S840 of FIG. 8 or step S940 of FIG. 9;

FIG. 12 is a flowchart of a detailed method of step S1040 in FIG. 10;

FIG. 13 is a flowchart of a specific method of step S950 in FIG. 9;

FIG. 14 is a flow chart of a data processing method provided by yet another embodiment of the present invention;

FIG. 15 is a flowchart of a specific method of step S1410 in FIG. 14;

FIG. 16 is a flowchart of a specific method of step S140 in FIG. 1;

FIG. 17 is a flow chart of a data processing method according to a further embodiment of the invention;

FIG. 18 is an exemplary diagram of data samples for different scene types provided by another embodiment of the present invention;

FIG. 19 is an exemplary diagram of data samples for different scene types provided by yet another embodiment of the present invention;

FIG. 20 is a diagram showing an example of a change in model judgment index value with model complexity according to another embodiment of the present invention;

FIG. 21 is an exemplary illustration of a portion of wine data under a classification scenario provided by another embodiment of the present invention;

fig. 22 is a diagram illustrating part of room price prediction data in a regression scenario according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different from that in the flowcharts. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The invention provides a data processing method and a device thereof, and a computer readable storage medium, wherein the data processing method comprises the following steps: receiving a data sample sent by a client, wherein the data sample comprises a sample label; obtaining consistency parameters between sample labels of adjacent data samples, determining quantity information of reasonable samples in the data samples according to the consistency parameters, and obtaining sample label rationality judgment values according to the quantity information; obtaining a target judgment value according to the sample label rationality judgment value; and obtaining a target judgment result according to the target judgment value. According to the scheme of the embodiment of the invention, the data samples including the sample labels sent by the client are received, the quantity information of reasonable samples is obtained by utilizing the consistency parameters between the sample labels of the adjacent data samples, the sample label rationality judgment value is obtained according to the quantity information, and the obtained sample label rationality judgment value can reflect whether the setting of the sample labels of the data samples is reasonable or not, so that the quality of the data samples can be effectively judged, the target judgment value is obtained according to the sample label rationality judgment value, and the target judgment result is obtained according to the target judgment value, namely, the scheme of the embodiment of the invention can improve the effectiveness of the judgment on the quality of the data samples.

The embodiments of the present invention will be further explained with reference to the drawings.

As shown in fig. 1, fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention, which may include, but is not limited to, step S110, step S120, step S130, and step S140.

Step S110: a data sample sent by a client is received, the data sample including a sample tag.

In this step, the number of data samples may be one or more, and when there are multiple data samples, the received data sample is a set of received data samples. The data samples include sample tags, and in an alternative embodiment, the data samples may be any set of data samples in the related art, such as: a room price prediction dataset, a wine dataset, etc. The data sample may also include a description field with a sample label, one data sample having one sample label.

It should be noted that the system architecture related in this embodiment may include a user computer and a server physical machine, where the user computer is a client, the client may be provided with a data uploading interface, the server physical machine is a server, and the server may provide a data sample determination service. The server may be a server in a local area network or a server in a network of the internet, and the data sample may be a data sample transmitted to the server through the client.

It should be noted that the sample label refers to a field name of the data sample, and for a data sample, the sample label column has a corresponding specific sample label value.

Step S120: and obtaining consistency parameters between sample labels of adjacent data samples, determining quantity information of reasonable samples in the data samples according to the consistency parameters, and obtaining sample label rationality judgment values according to the quantity information.

In this step, between adjacent data samples, each data sample has a corresponding sample label, and the consistency of the sample labels indicates whether the sample labels of the data samples are the same or similar. The consistency parameter is a parameter obtained by determining a sample label between adjacent data samples, and may be a numerical value or a coefficient of variation value of the sample label.

It should be noted that, in an alternative embodiment, when the data corresponding to the sample tags of the data samples is discrete, that is, the sample tags of the data samples may be exhaustive, and the consistency of the sample tags indicates whether the sample tags are the same. When the data corresponding to the sample label of the data sample is continuous, that is, the value of the sample label of the data sample is continuous and inexhaustible, the sample label consistency may be obtained by any method of determining the similarity between the data of the sample label in the related art, for example, setting a range value or calculating a corresponding coefficient of variation, and the like, as long as it is possible to determine whether the data of the sample label is similar, which is not limited herein.

It should be further noted that the reasonable samples refer to data samples with high tag rationality, and a sample tag rationality judgment value is obtained according to the quantity information of the reasonable samples, so that the purpose of improving the judgment on the quality of the data samples can be achieved.

Step S130: and obtaining a target judgment value according to the sample label rationality judgment value.

In this step, the sample tag rationality judgment value refers to a judgment value obtained according to a consistency parameter of sample tags of adjacent data samples, and a target judgment value is obtained according to the sample tag rationality judgment value, and may be obtained by weighting the sample tag rationality judgment value or calculating according to a certain coefficient. The target judgment value is set to facilitate obtaining a target judgment result according to the target judgment value in the subsequent step.

Step S140: and obtaining a target judgment result according to the target judgment value.

In this step, the target judgment result refers to a final output result of the scheme of the embodiment of the present invention, the target judgment result is obtained according to the target judgment value, the target judgment value is obtained according to the sample tag rationality judgment value, and the target judgment result reflects the judgment of the rationality of the sample tag information, so that the quality of the data sample is judged, and the validity of the judgment of the quality of the data sample can be improved.

It should be noted that the target determination result may be an output result in any form in the related art, and in an alternative embodiment, the target determination result may be a specific value, or may be a specific determination result on the quality of the data sample, for example: excellence, goodness, etc., as long as they can embody the judgment of the quality of the data sample, which are not listed here.

In this embodiment, a data processing method including the steps S110 to S140 is adopted to receive a data sample, where the data sample includes a sample tag; obtaining a sample label rationality judgment value according to the consistency parameter of the sample labels between the adjacent data samples; obtaining a target judgment value according to the sample label rationality judgment value; and obtaining a target judgment result according to the target judgment value. According to the scheme of the embodiment of the invention, the data sample comprising the sample label is received, the sample label rationality judgment value is obtained by utilizing the consistency parameter of the sample label between the adjacent data samples, the obtained sample label rationality judgment value can reflect whether the setting of the sample label of the data sample is reasonable or not, so that the quality of the data sample can be effectively judged, the target judgment value is obtained according to the sample label rationality judgment value, and the target judgment result is obtained according to the target judgment value, namely, the scheme of the embodiment of the invention can improve the effectiveness of the judgment on the quality of the data sample.

It should be noted that the sample label rationality determination value of the data sample is used to determine an abnormal value of the sample label, and the abnormal value of the sample label may be a noise value or a mis-marked value. If the ratio of the abnormal values of the sample labels of the data samples is too large, overfitting of the model can be caused, so that optimization of the machine learning model and a judgment result of the optimization are seriously influenced, and therefore the reasonability of the target value of the data set sample needs to be judged to achieve the purpose of improving the effectiveness of judgment on the quality of the data samples.

In an embodiment, as shown in fig. 2, to further describe step S120, step S120 may include, but is not limited to, step S121, step S122, step S123, and step S124.

Step S121: and traversing the data samples to obtain the currently traversed target data samples.

In this step, the currently traversed target data sample refers to a data sample in the current traversal process when traversing the data sample, and in an optional embodiment, the data sample is read in an array manner, a manner of traversing the subscript of the array is adopted in the traversal process, and the currently traversed target data sample is a data sample corresponding to the subscript in the array of the data sample. The target data sample is obtained for the purpose of determining the reasonableness of the target data sample in the subsequent steps.

In an alternative embodiment, referring to fig. 21, if the target data sample is wine data with a serial number of 2 in the wine data set, the sample label of the target data sample is 1, and the sample characteristics 1 to 13 are: 14.23,1.71,2.43,15.6,127,2.8,3.06,0.282.29,5.64,1.04,3.92,1065. Alternatively, referring to fig. 22, if the target data sample is room price forecast data with serial number 2 in the room price forecast data set, the sample label of the target data sample is: 4.526, sample feature 1 through sample feature 8 are: 8.3252,41,6.984126984,1.023809124,322,2.555555556,37.88, -122.23.

Step S122: a proximity data sample is acquired, the proximity data sample being a smaller distance from the target data sample than the non-proximity data sample.

In this step, the number of the neighboring data samples may be any number, and the number may be obtained by presetting, and the distance between the neighboring data samples and the target data sample is smaller than the distance between the non-neighboring data samples and the target data sample, so the neighboring data samples refer to the preset number of data samples having smaller distances from the target data sample. The adjacent data samples are obtained for the purpose of determining the plausibility of the target data sample in subsequent steps.

In an alternative embodiment, referring to fig. 21, the calculation formula of the distance between the wine data with number 2 and the wine data with number 3 is: (14.23-13.2)²+(1.71-1.78)²+(2.43-2.14)²+(15.6-11.2)²+(127-100)²+(2.8-2.65)²+(3.06-2.76)²+(0.28-0.26)²+(2.29-1.28)²+(5.64-4.38)²+(1.04-1.05)²+(3.92-3.4)²+(1065-1050)²256.87, the nearest neighbor data sample to the target data sample can be obtained by the same distance calculation method.

Step S123: and acquiring consistency parameters between the sample labels of the adjacent data samples and the sample label of the target data sample, and judging whether the target data sample is a reasonable sample according to the consistency parameters.

In this step, a reasonable sample refers to a reasonable target data sample, that is, a target data sample whose sample label is the same as or similar to the sample label of an adjacent data sample.

It should be noted that the consistency parameter between the sample label of the neighboring data sample and the sample label of the target data sample refers to a parameter obtained by judging whether the sample label of the neighboring data sample is the same as or similar to the sample label of the target data sample, and whether the target data sample is a reasonable sample is judged according to the consistency parameter of the sample labels between the neighboring data samples, so as to obtain the quantity information of the reasonable samples in the subsequent steps.

And step S124, determining the number information of reasonable samples in the data samples according to the judgment result of the target data samples.

In this step, the number information of the reasonable samples refers to the proportion of the number of the reasonable samples in all the data samples, and the sample label rationality judgment value is obtained according to the proportion condition. In an alternative embodiment, assuming that there are 178 data samples and the reasonable number of samples is 174, the sample label rationality determination value may be: 174/178 ═ 0.966.

It should be noted that, in an optional embodiment, all the data samples are traversed, and the determination result refers to a result obtained by determining whether the data sample is a reasonable sample, so that the number information of the reasonable samples in the data sample can be determined according to the determination result.

It should be further noted that, since the total number of the data samples is obtained by adding the number of the reasonable samples and the number of the data samples that are not the reasonable samples, in an optional embodiment, the number of the data samples that are not the reasonable samples is recorded in the traversal process, and a sample label rationality determination value can also be obtained according to the number of the data samples, which is not described herein again.

In this embodiment, by using the data processing method including the above-mentioned steps S121 to S124, the data sample is traversed, the currently traversed target data sample is obtained, the neighboring data sample is obtained, a distance between the neighboring data sample and the target data sample is smaller than a distance between the non-neighboring data sample and the target data sample, whether the target data sample is a reasonable sample is determined according to a consistency parameter between a sample label of the neighboring data sample and a sample label of the target data sample, and a sample label rationality determination value is obtained according to quantity information of the reasonable sample. According to the scheme of the embodiment of the invention, the sample label rationality judgment value is obtained through the consistency parameter of the sample labels between the adjacent data samples, so that the quality of the data samples is judged, and the aim of improving the effectiveness of the judgment on the quality of the data samples is fulfilled.

In an embodiment, as shown in fig. 3, to further describe step S123, step S123 may include, but is not limited to, step S310, step S320, and step S330.

Step S310, the scene type of the target operation scene is obtained.

In this step, the scene type of the target operation scene may be a classification scene or a regression scene, where the two scene types are different in that a sample label of the classification scene is usually set or marked manually, and a sample label in the regression scene is usually acquired through system acquisition, rather than being designed or calibrated manually, so that there is a certain difference in the data set sample characteristic capability evaluation flows of the two scenes.

It should be noted that obtaining the scene type of the target operation scene can facilitate determining whether the target data sample is a reasonable sample in the subsequent steps.

Step S320: the scene type is a classification scene, and a quantity value with the same sample label between the adjacent data sample and the target data sample is determined.

In this step, the quantitative value refers to the quantitative value with the same sample label between the adjacent data sample and the target data sample. In an alternative embodiment, there are 10 neighboring data samples, where the sample label of the 6 neighboring data samples is the same as the sample label of the target data sample, and then the quantitative value is determined to be 6, or the quantitative value is determined to be 6/10 ═ 0.6. The quantitative value is determined to facilitate determining whether the data sample is a legitimate sample in a subsequent step.

It should be noted that the scene type is a classification scene, that is, the sample labels of the current data samples are discrete, and the values of the sample labels of the data samples may be exhaustive, so that the quantity value may be determined by determining whether the sample labels between the data samples are the same.

It should be further noted that the sample label of the adjacent data sample is the same as the sample label of the target data sample, that is, the sample label is closer to the sample characteristic of the adjacent data sample, and the target data sample is determined to be a reasonable sample according to the same quantity value of the sample label of the target data sample and the sample label of the adjacent data sample, so as to achieve the purpose of determining the quality of the data sample.

Step S330: the quantitative value is determined as a consistency parameter between the sample label of the neighboring data sample and the sample label of the target data sample.

In this step, the consistency parameter is used to determine quantity information of reasonable samples in the data samples, so as to obtain a sample label rationality judgment value according to the quantity information. When the scene type is a classification scene, the quantity value is determined as a consistency parameter, and the purpose of improving the effectiveness of judging the quality of the data sample can be achieved.

In this embodiment, by using the data processing method including the steps S310 to S330, the scene type is a classification scene, the scene type of the target operation scene is obtained, and the quantity value of the sample label of the adjacent data sample which is the same as the sample label of the target data sample is determined; the quantitative value is determined as a consistency parameter between the sample label of the neighboring data sample and the sample label of the target data sample. According to the scheme of the embodiment of the invention, under a classification scene, the quantity value which is the same as the sample label between the adjacent data samples is judged, and the quantity value is determined as the consistency parameter, so that the sample label rationality judgment value is obtained, and the purpose of improving the effectiveness of the judgment on the quality of the data samples is achieved.

In an embodiment, as shown in fig. 4, step S123 is further described, and step S123 may further include, but is not limited to, step S410 and step S420.

Step S410: and if the quantity value is greater than or equal to the preset quantity threshold value, determining that the target data sample is a reasonable sample.

In this step, the preset number threshold may be set by a user or set by a system administrator, and the number value is greater than or equal to the preset number threshold, that is, the number of labels of the neighboring data samples is greater than the number of labels of the target data samples, and it is determined that the target data sample is a reasonable sample. In an alternative embodiment, referring to fig. 21, assuming that the data sample with the sequence number 2 is a target data sample, the preset number threshold is 3, the adjacent data samples are wine data with the sequence numbers 3 to 8, the sample labels of the wine data with the

sequence numbers

3, 4, 5, 6, and 8 are the same as the sample label of the wine data with the sequence number 2, and the sample label of the wine data with the sequence number 7 is not the same as the sample label of the wine data with the sequence number 2, the number value is 5, and the wine data with the sequence number 2 is a reasonable sample.

Step S420: and if the quantity value is smaller than the preset quantity threshold value, determining that the target data sample is not a reasonable sample.

In this step, the number value is smaller than the preset number threshold, that is, the number of labels of the neighboring data samples is less than the number of sample labels of the target data sample, and it is determined that the target data sample is not a reasonable sample. In an alternative embodiment, when there is more than one neighboring data sample, the target data sample may be a reasonable sample by determining a quantity value of the sample label of the neighboring data sample and the sample label of the target data sample, where the quantity value is smaller than a preset quantity threshold, and the target data sample is a reasonable sample when the quantity value is larger than the preset quantity threshold. Referring to fig. 19, in the data samples of the classification scene, the same graph represents the data samples with the same sample label, and the sample labels of the circled data sample a and data sample B in the graph are different from those of the adjacent data samples, and the number value is smaller than the preset number threshold, and both the data sample a and the data sample B are not reasonable samples.

In this embodiment, by using the data processing method including the steps S410 to S420, if the number value is greater than or equal to the preset number threshold, it is determined that the target data sample is a reasonable sample; or, if the quantity value is smaller than the preset quantity threshold, determining that the target data sample is not a reasonable sample. According to the scheme of the embodiment of the invention, whether the target data sample is a reasonable sample is judged by comparing the quantity value with the preset quantity threshold value, so that the quality of the data sample is judged, and the aim of improving the effectiveness of judging the quality of the data sample is fulfilled.

In an embodiment, as shown in fig. 5, the step S123 is further described, and the step S123 may further include, but is not limited to, the step S510, the step S520, and the step S530.

Step S510, a scene type of the target operation scene is obtained.

Step S520: when the scene type is a regression scene, the variation coefficient of the sample label of the adjacent data sample is obtained according to the sample label of the adjacent data sample and the sample label of the target data sample.

In this step, the coefficient of variation may be a coefficient obtained by dividing a mean value of the neighboring data samples by a standard deviation of the neighboring data samples, and the larger the coefficient of variation of the sample label of the neighboring data sample is, the larger the probability that the sample label of the target data sample is an abnormal value is. When the scene type is a regression scene, that is, the sample labels of the data samples are continuous, and the sample labels of the data samples are not inferable, in an optional embodiment, the variation coefficient of the sample labels of the adjacent data samples is obtained according to the sample labels of the adjacent data samples and the sample label of the target data sample, and the obtained variation coefficient is used for facilitating the determination of the consistency parameter in the subsequent step.

It should be noted that the variation coefficient can eliminate the influence of the measurement scale and dimension, and when there is more than one adjacent data sample, the variation coefficient can be calculated according to all sample labels of the adjacent data samples, so that the rationality of the target data sample can be determined.

Step S530: the coefficient of variation is determined as a consistency parameter between the sample labels of the neighboring data samples and the sample label of the target data sample.

In this step, the consistency parameter is used to determine quantity information of reasonable samples in the data samples, so as to obtain a sample label rationality judgment value according to the quantity information. When the scene type is a regression scene, the variation coefficient is determined as the consistency parameter, and the purpose of improving the effectiveness of judging the quality of the data sample can be achieved.

In this embodiment, by using the data processing method including the steps S510 to S530, where the scene type is a regression scene, the scene type of the target operation scene is obtained, and the variation coefficient of the sample label of the adjacent data sample is obtained according to the sample label of the adjacent data sample and the sample label of the target data sample; the coefficient of variation is determined as a consistency parameter between the sample labels of the neighboring data samples and the sample label of the target data sample. According to the scheme of the embodiment of the invention, under the regression scene, the possible label value of the sample label is not countable, and the variation coefficient is judged and determined as the consistency parameter, so that the sample label rationality judgment value is obtained, and the purpose of improving the effectiveness of the judgment on the data sample quality is achieved.

In an embodiment, as shown in fig. 6, to further describe step S123, step S123 may include, but is not limited to, step S610 and step S620.

Step S610: and if the coefficient of variation is smaller than or equal to a preset coefficient threshold value, determining that the target data sample is a reasonable sample.

In this step, the preset coefficient threshold refers to a preset variation coefficient threshold, and since the larger the variation coefficient is, the higher the probability that the sample label of the target data sample is an abnormal value is, if the variation coefficient is smaller than the preset coefficient threshold, it is determined that the target data sample is a reasonable sample. In an alternative embodiment, referring to fig. 22, the room price prediction data with the serial number 2 is a target data sample, the preset coefficient threshold is equal to 0.5, the room price prediction data with the

serial numbers

3, 4, 5, 6, and 7 are neighboring data samples, and the sample labels thereof are 3.585,3.521,3.413,3.422, and 2.697, respectively, the coefficient of variation is calculated by dividing the standard deviation of the sample labels by the mean value of the sample labels, or by (the sample label of the target data sample-the sample label mean value)/the sample label mean value, and using the formula of (the sample label of the target data sample-the sample label mean value)/the sample label mean value, if the coefficient of variation is var:

since the value of the coefficient of variation var is smaller than the preset coefficient threshold, the room price prediction data with the serial number 2 is a reasonable sample.

Step S620: and determining that the target data sample is not a reasonable sample when the coefficient of variation is larger than a preset coefficient threshold.

In this step, the preset coefficient threshold refers to a preset variation coefficient threshold, and since the larger the variation coefficient is, the higher the probability that the sample label of the target data sample is an abnormal value is, if the variation coefficient is larger than the preset coefficient threshold, it is determined that the target data sample is not a reasonable sample. In an alternative embodiment, referring to fig. 19, in the data samples of the regression scene, the distance between the data sample C and the data sample D is far from the regression line, that is, the difference between the sample labels of the target data sample and the adjacent data samples is large, the obtained variation coefficient is larger than the preset coefficient threshold, and the data sample C and the data sample D are not reasonable samples.

In this embodiment, by using the data processing method including the steps S610 to S620, if the coefficient of variation is smaller than or equal to the preset coefficient threshold, it is determined that the target data sample is a reasonable sample, and if the coefficient of variation is greater than the preset coefficient threshold, it is determined that the target data sample is not a reasonable sample. According to the scheme of the embodiment of the invention, in a regression scene, whether the target data sample is a reasonable sample is judged by comparing the variation coefficient with the preset coefficient threshold value, so that the quality of the data sample is judged, and the aim of improving the effectiveness of the judgment on the quality of the data sample is fulfilled.

In an embodiment, as shown in fig. 7, the data processing method is further described, and the data processing method may further include, but is not limited to, step S710, step S720, and step S730.

Step S710: and acquiring the scene type of the target operation scene.

It should be noted that obtaining the scene type of the target operation scene can facilitate the determination of the quality of the data sample in the subsequent steps, for example: whether the sample labels of the data samples are continuous or not can be determined through the scene types, so that the sample label rationality judgment value can be obtained through a specific calculation mode.

Step S720: and obtaining a target sample characteristic characterization capability judgment value according to the scene type and the data sample comprising the sample characteristic.

In this step, the sample feature refers to a sample feature field included in the data sample, and one data sample may include one or more sample features. In an optional embodiment, the target sample characteristic characterization capability judgment value is obtained according to a scene type and the data sample, and means that the data sample is correspondingly processed according to different scene types, so that the target sample characteristic characterization capability judgment value of the data sample is obtained.

Step S730: obtaining a target judgment value according to the sample label rationality judgment value, comprising: and obtaining a target judgment value according to the sample label rationality judgment value and the target sample characteristic characterization capability judgment value.

In this step, a target judgment value is obtained according to the sample tag rationality judgment value and the target sample characteristic characterization capability judgment value, the sample tag rationality judgment value and the target sample characteristic characterization capability judgment value may be respectively used as components of the target judgment value, or the sample tag rationality judgment value and the target sample characteristic characterization capability judgment value may be subjected to weighting processing, so as to be respectively used as components of the target judgment value.

In this embodiment, the data processing method including the steps S710 to S730 is adopted to obtain a scene type of a target operation scene, obtain a target sample feature characterization capability judgment value according to the scene type and a data sample including a sample feature, and obtain a target judgment value according to a sample tag rationality judgment value, where the method includes: and obtaining a target judgment value according to the sample label rationality judgment value and the target sample characteristic characterization capability judgment value. According to the scheme of the embodiment of the invention, the data sample comprising the sample characteristics is processed according to the scene type, so that the characteristic characterization capability judgment value of the target sample is obtained, and then the target judgment value is obtained according to the sample label rationality judgment value and the characteristic characterization capability judgment value of the target sample, so that the aim of improving the effectiveness of the judgment on the quality of the data sample is fulfilled.

In an embodiment, as shown in fig. 8, the step S720 is further described, and the step S720 may include, but is not limited to, the steps S810, S820, S830, S840, and S850.

Step S810: and the scene type is a classification scene, and a neural network model is obtained.

In this step, when the scene type is a classification scene, that is, the sample labels of the data samples are discrete, as shown in the data samples of the classification scene in fig. 18, the data of the sample labels may be exhaustive. The neural network model refers to any neural network model in the related art, and in an alternative embodiment, the neural network model refers to a neural network model for machine learning, as long as the neural network model can perform functions of training and predicting data samples, which are not listed herein.

It should be noted that the obtaining of the neural network model is to facilitate training and prediction of the neural network model in subsequent steps, and to achieve the purpose of judging the characterization capability of the sample features of the data sample.

Step S820: and training the neural network model by using the data sample to obtain a training result.

In this step, the data sample trains the neural network model, which may be any training process in the related art, and in an optional implementation, the data sample includes preset related parameters, such as batch information, learning rate, and the like, and is input into the neural network model, and then training is performed to obtain a training result. The training result can be a prediction model output after the training is finished, so that the neural network model can be conveniently predicted according to the data sample and the training result in the subsequent step.

Step S830: and predicting the neural network model according to the data sample and the training result to obtain a prediction result, and obtaining a model judgment index value according to the prediction result.

In this step, the neural network model is predicted according to the data sample and the training result to obtain a prediction result, and in an optional embodiment, the data sample may be input into the prediction model of the training result to obtain the prediction result. The prediction result may be an output prediction sample label value, and a model judgment index value is obtained according to the prediction result, and in an alternative embodiment, the sample label value of the prediction result may be compared with data of a sample label of an actual data sample, and a difference value or a ratio between the two values may be used as the model judgment index value.

The model determination index value may be an index value such as accuracy and recall in the related art, as long as the accuracy of prediction of the data sample by the neural network model can be expressed.

Step S840: and determining a first sample feature characterization capability judgment value according to the model judgment index value.

In this step, the first sample characteristic characterization capability determination value is determined according to the model determination index value, and may be determined according to change information of the model determination index value, in an optional embodiment, the number of times that the model determination index value is greater than the preset index threshold value is accumulated and calculated, that is, when the model determination index value is greater than the preset index threshold value, the sample characteristic of the data sample can have a better characterization capability for the sample label of the data sample, and the number of times obtained by accumulation may be used as the first sample characteristic characterization capability determination value.

It should be noted that the number of times that the data sample trains and predicts the neural network model may be multiple times, which is not specifically limited herein, so as to obtain multiple model judgment index values.

It should be further noted that, when the data samples train and predict the neural network model multiple times, the neural network model may be changed, for example: the complexity of the neural network model is increased or different neural network models are replaced, so that the characterization capability of the sample characteristics of the data sample on the sample label of the data sample is embodied.

Step S850: and determining the first sample characteristic characterization capability judgment value as a target sample characteristic characterization capability judgment value.

In the step, in a classification scene, after training and predicting a neural network model through a data sample, a first sample characteristic characterization capability judgment value is obtained, and the first sample characteristic characterization capability judgment value is directly used as a target sample characteristic characterization capability judgment value, so that the characteristic capability of the sample characteristics of the data sample on the sample label of the data sample is embodied, and the purpose of judging the quality of the data sample is achieved.

In this embodiment, by using the data processing method including the steps S810 to S850, when the scene type is a classification scene, the neural network model is obtained, the data sample trains the neural network model to obtain a training result, the neural network model is predicted according to the data sample and the training result to obtain a prediction result, a model judgment index value is obtained according to the prediction result, a first sample feature characterization capability judgment value is determined according to the model judgment index value, and the target sample feature characterization capability judgment value is determined to be the first sample feature characterization capability judgment value. According to the scheme of the embodiment of the invention, when the scene type is a classified scene, the model judgment index value is obtained according to the training and prediction of the data sample on the neural network model, so that the first sample characteristic characterization capability judgment value is obtained according to the size or the change information of the model judgment index value, and then the first sample characteristic characterization capability judgment value is used as the target sample characteristic characterization capability judgment value, thereby achieving the purpose of improving the effectiveness of the judgment on the quality of the data sample.

In an embodiment, as shown in fig. 9, the step S720 is further described, and the step S720 may include, but is not limited to, the steps S910, S920, S930, S940, S950, and S960.

Step S910: and obtaining the neural network model when the scene type is a regression scene.

In this step, when the scene type is a regression scene, that is, the sample labels of the data samples are continuous, as shown in the data samples of the regression scene in fig. 18, the data of the sample labels are not exhaustive. The neural network model refers to any neural network model in the related art, and in an alternative embodiment, the neural network model refers to a neural network model for machine learning, as long as the neural network model can perform functions of training and predicting data samples, which are not listed herein.

Step S920: and training the neural network model by using the data sample to obtain a training result.

Step S930: and predicting the neural network model according to the data sample and the training result to obtain a prediction result, and obtaining a model judgment index value according to the prediction result.

Step S940: and determining a first sample feature characterization capability judgment value according to the model judgment index value.

Step S950: and obtaining a second sample characteristic characterization capability judgment value according to the correlation information of the sample characteristics and the sample label.

In this step, the correlation information between the sample feature and the sample label may be a correlation coefficient value between the sample feature and the sample label, and the correlation information may be a correlation coefficient between the sample feature and the sample label, for example: the Pierce coefficient, the Spierman coefficient and the like, if the correlation between the sample characteristics and the sample labels is higher, the prediction support degree of the sample characteristics on the sample labels is higher.

It should be noted that, in an optional embodiment, a second sample feature characterization capability judgment value is obtained according to the correlation information between the sample feature and the sample tag, a numerical value of the correlation coefficient may be used as the second sample feature characterization capability judgment value, or a preset correlation threshold value, the number of times that the correlation coefficient between the sample feature and the sample tag is greater than the correlation threshold value is counted, and the number of times is used as the second sample feature characterization capability judgment value.

Step S960: and obtaining a target sample characteristic characterization capability judgment value according to the first sample characteristic characterization capability judgment value and the second sample characteristic characterization capability judgment value.

In this step, the target sample feature characterization capability judgment value is obtained according to the first sample feature characterization capability judgment value and the second sample feature characterization capability judgment value, the target sample feature characterization capability judgment value may be obtained by adding or weighting the first sample feature characterization capability judgment value and the second sample feature characterization capability judgment value, and the quality of the data sample is judged according to the sample feature to the characterization capability of the sample label.

In this embodiment, by using the data processing method including the steps S910 to S960, when the scene type is a regression scene, a neural network model is obtained, the data sample trains the neural network model to obtain a training result, the neural network model is predicted according to the data sample and the training result to obtain a prediction result, a model judgment index value is obtained according to the prediction result, a first sample feature characterization capability judgment value is determined according to the model judgment index value, a second sample feature characterization capability judgment value is obtained according to the correlation information between the sample feature and the sample label, and a target sample feature characterization capability judgment value is obtained according to the first sample feature characterization capability judgment value and the second sample feature characterization capability judgment value. According to the scheme of the embodiment of the invention, when the scene type is a regression scene, a model judgment index value is obtained according to the training and prediction of a data sample on a neural network model, so that a first sample characteristic characterization capability judgment value is obtained according to the size or the change information of the model judgment index value, a second sample characteristic characterization capability judgment value is obtained according to the correlation information of a sample characteristic and a sample label, the first sample characteristic characterization capability judgment value and the second sample characteristic characterization capability judgment value respectively reflect the characterization capability of the sample characteristic on the sample label from the aspects of the training and the prediction of the neural network model and the correlation of the sample characteristic and the sample label, so that the judgment on the quality of the data sample is reflected, and a target sample characteristic characterization capability judgment value is obtained according to the first sample characteristic characterization capability judgment value and the second sample characteristic characterization capability judgment value, the purpose of improving the effectiveness of judging the quality of the data sample is achieved.

In an embodiment, as shown in fig. 10, step S840 or step S940 is further described, and step S840 or step S940 may include, but is not limited to, step S1010, step S1020, step S1030, and step S1040.

Step S1010: and if the current iteration times are less than the preset iteration times, increasing the complexity of the neural network model.

In this step, if the current iteration number is smaller than the preset iteration number, it indicates that the complexity of the current neural network model needs to be increased, and the complexity of the neural network model is increased. Increasing the complexity of the neural network model may be by any means of increasing the complexity of the neural network model in the related art, for example; increasing the number of each convolutional layer in the neural network model, or replacing the neural network model with a higher complexity, or increasing the number of neurons in each layer in the neural network, etc., as long as the function of increasing the complexity of the neural network model can be achieved, which is not listed here.

Step S1020: and training the neural network model with the improved complexity by using the data sample, and updating a training result.

In this step, the neural network model with the increased complexity is trained by using the data sample, so as to update the training result, wherein the training result can be a prediction model obtained through training, and the updating of the training result is convenient for the prediction of the data sample on the neural network model with the increased complexity in the subsequent step.

Step S1030: and predicting the neural network model with the improved complexity according to the training result to obtain a new model judgment index value.

In this step, the neural network model with the increased complexity is predicted according to the training result, so as to obtain a new model judgment index value, where the model judgment index value may be any index value in the related art, for example, a classification scene may use an accuracy and a recall ratio as the model judgment index value, and a regression scene may use an R2 value as the model judgment index value. The new model judgment index value is obtained to facilitate the updating of the first statistical value in the subsequent steps.

Step S1040: and updating the first statistical value according to the change information of the model judgment index value, wherein the first statistical value is set according to the model judgment index value.

In this step, the first statistical value is used for the case that the sample feature characterization capability of the statistical data sample is strong, so that a first sample feature characterization capability judgment value can be obtained in the subsequent steps.

It should be noted that, the value of the first statistical value may be initialized to 0, and the first statistical value is set according to the model judgment index value, and in an optional embodiment, when the increment of the model judgment index value is greater than 0 or the model judgment index value is greater than a preset model evaluation index threshold, the first statistical value is accumulated.

It should be noted that, since the model judgment index value is updated, the change information of the model judgment index value may be an increment of the model judgment index value, or may be a judgment of whether the model judgment index value is greater than a preset model judgment index threshold value, so as to update the first statistical value.

In this embodiment, by using the data processing method including the steps S1010 to S1040, if the current iteration number is less than the preset iteration number, the complexity of the neural network model is increased, the data sample trains the neural network model with increased complexity, the training result is updated, the data sample predicts the neural network model with increased complexity according to the training result to obtain a new model judgment index value, and the first statistical value is updated according to the change information of the model judgment index value, where the first statistical value is set according to the model judgment index value. According to the scheme of the embodiment of the invention, the first statistic value set according to the model judgment index value is obtained according to the complexity of the neural network model and the model judgment index value obtained by predicting the sample label, so that the aim of effectiveness of judgment on the quality of the data sample is fulfilled.

It is worth noting that, referring to fig. 20, a first sample feature characterization capability judgment value is obtained by using a correspondence relationship between a complexity increment of a neural network model and a sample tag prediction capability increment, if a sample feature of one data sample has a sufficiently strong characterization capability for a sample tag, in a process of evolution of the neural network model from low complexity to high complexity, fitting of the sample feature for the sample tag evolves from under-fitting, ideal fitting, and over-fitting, which means that as the complexity of the neural network model increases, a model evaluation index value significantly increases and can reach a certain threshold requirement, and a prediction relationship curve between the model complexity and the sample tag is shown as a graph in fig. 20; if the sample characteristics of the data sample have weak characterization capability on the sample labels, in the process of evolution of the model from low complexity to high complexity, the fitting result of the sample characteristics on the sample labels does not change significantly, and is always under-fitted, which means that as the complexity of the neural network model increases, the evaluation indexes predicted by the sample labels are not ideal, the evaluation indexes oscillate, and the range overall remains unchanged, as shown in a B diagram of fig. 20.

In an embodiment, as shown in fig. 11, step S840 or step S940 is further described, and step S840 or step S940 may further include, but is not limited to, step S1110.

Step S1110: and if the current iteration times are equal to the preset iteration times, obtaining a first sample characteristic characterization capability judgment value according to a first statistical value and the current iteration times, wherein the first statistical value is set according to the model judgment index value.

It should be further noted that, if the current iteration number is equal to the preset iteration number, which indicates that the current iteration number meets the preset iteration requirement, the iteration is stopped. Because the first statistical value is set by accumulating after each iteration, the first sample characteristic characterization capability judgment value can be obtained according to the first statistical value and the current iteration number, that is, the first sample characteristic characterization capability judgment value is obtained according to the complexity of the neural network model and the sample label prediction capability, so that the purpose of improving the effectiveness of the judgment on the quality of the data sample is achieved. In an alternative embodiment, the first sample characterization capability judgment value is equal to the first statistical value divided by twice the preset number of iterations.

In this embodiment, by using the data processing method including the step S1110, if the current iteration number is equal to the preset iteration number, a first sample feature characterization capability determination value is obtained according to a first statistical value and the current iteration number, where the first statistical value is set according to a model determination index value. According to the scheme of the embodiment of the invention, if the current iteration number is equal to the preset iteration number, the complexity of the neural network is not increased, and the first sample characteristic capability judgment value is obtained according to the first statistical value and the current iteration number, so that the purpose of obtaining the first sample characteristic capability judgment value of the data sample is achieved.

In an embodiment, as shown in fig. 12, the step S1040 is further described, and the step S1040 may include, but is not limited to, the steps S1041 and S1042.

Step S1041: and if the model judgment index value is increased, accumulating the first statistical value to obtain an updated first statistical value.

In this step, in the current iteration, the model judgment index value of the previous iteration may be additionally stored in a temporary variable, and the increment of the model judgment index value is a value obtained by subtracting the model judgment index value of the previous iteration from the new model judgment index value. If the increment of the model judgment index value is larger than 0, namely the model judgment index value is increased, the first statistical value is accumulated to obtain the updated first statistical value.

Step S1042: and if the model judgment index value is larger than the preset judgment index value, accumulating the first statistical value to obtain an updated first statistical value.

In this step, if the model judgment index value is greater than the preset judgment index value, it means that the fitting of the current sample feature to the sample label reaches an ideal fitting state or an over-fitting state in the current iteration, which indicates that the current model judgment index value is reasonable, so that the first statistical value is accumulated to obtain the updated first statistical value, that is, the characterization ability of the current sample feature to the sample label is strong.

In this embodiment, by using the data processing method including the above steps S1041 to S1042, if the model judgment index value becomes larger, the first statistical value is accumulated to obtain the updated first statistical value, and if the model judgment index value is larger than the preset judgment index value, the first statistical value is accumulated to obtain the updated first statistical value. According to the scheme of the embodiment of the invention, the first statistical value is set according to the model judgment index value, namely, the situation that the sample feature has stronger characterization capability on the sample label is counted, so that the first characterization capability index value is obtained, and the purpose of improving the effectiveness of the judgment on the quality of the data sample can be achieved.

In an embodiment, as shown in fig. 13, step S950 is further described, and step S950 may include, but is not limited to, step S951, step S952, step S953, and step S954.

Step S951: and traversing the sample characteristics of the data sample to obtain the currently traversed target sample characteristics.

In this step, since the data sample includes the sample feature, the sample feature of the data sample can be traversed to obtain the target sample feature in the currently traversed data sample, which is convenient for calculating the correlation coefficient in the subsequent step.

Step S952: and obtaining a correlation coefficient of the target sample characteristic and the sample label.

In this step, the correlation coefficient of the target sample feature and the sample label may be a pearson coefficient or a spearman coefficient, and the correlation coefficient may be obtained by calculation, which is convenient for setting the second statistical value in the subsequent step.

Step S953: setting a second statistical value according to the correlation coefficient.

In this step, the second statistical value may be used to count the number of times that the correlation coefficient is greater than the correlation threshold in the traversal process, and the second statistical value is set according to the correlation coefficient between the target sample feature and the sample label, where the second statistical value is obtained by accumulating when the correlation coefficient is greater than the correlation threshold in the traversal process, and the second statistical value serves to represent the correlation between the target sample feature and the sample label.

Step S954: and obtaining a second sample characteristic characterization capability judgment value according to the second statistical value and the total number of the sample characteristics.

In this step, after traversing the sample features, a second sample feature characterization capability judgment value is obtained according to the second statistical value and the total number of the sample features, in an optional implementation manner, the second sample feature characterization capability judgment value may be equal to the second statistical value divided by the total number of the sample features, and since the second statistical value reflects the correlation between the target sample features and the sample labels, the characterization capability of the sample features on the sample labels can be reflected, thereby improving the effectiveness of judging the quality of the data samples.

In this embodiment, by using the data processing method including the above steps S951 to S954, the data sample is traversed, the feature of the currently traversed target sample is obtained, the correlation coefficient between the feature of the target sample and the sample label is obtained, the second statistical value is set according to the correlation coefficient, and the second sample feature characterization capability judgment value is obtained according to the second statistical value and the total number of the sample features. According to the scheme of the embodiment of the invention, the second sample characteristic characterization capability judgment value is obtained according to the correlation condition of the target sample characteristic and the sample label, so that the aim of improving the effectiveness of the judgment on the quality of the data sample can be achieved.

In an embodiment, as shown in fig. 14, the data processing method is further described, and the data processing method may further include, but is not limited to, step S1410 and step S1420.

Step S1410, the data sample further includes sample characteristics, and a target data distribution rationality judgment value is determined according to the sample characteristics.

In this step, the sample feature refers to a sample feature field in the data sample, and one data sample may have one or more sample features. The target data distribution rationality judgment value refers to a distribution rationality judgment value of data of the sample feature.

It should be noted that, for a sample feature of a data sample, if a large number of abnormal values exist in the data distribution of the sample feature or a serious long tail effect exists, the effectiveness of model optimization learning and a model evaluation result are seriously affected, so that the period of model development and landing is prolonged. If the data distribution of each sample characteristic of the data set sample can be evaluated, on one hand, the quality of the data sample can be optimized by perfecting the collection of the data sample, so that the model development and the landing are accelerated, on the other hand, a technical scheme and a decision reference are provided for the preprocessing of the data sample, and the quality of the data preprocessing scheme is improved, so that the model development and the landing are accelerated.

Step S1420, obtaining a target judgment value according to the sample label rationality judgment value, including: and obtaining a target judgment value according to the sample label rationality judgment value and the target data distribution rationality judgment value.

In this step, the target judgment value is obtained according to the sample tag rationality judgment value and the target data distribution rationality judgment value, the sample tag rationality judgment value and the target data distribution rationality judgment value may be respectively used as components of the target judgment value, or the sample tag rationality judgment value and the target data distribution rationality judgment value may be subjected to weighting processing, and thus, the sample tag rationality judgment value and the target data distribution rationality judgment value may be respectively used as components of the target judgment value.

In this embodiment, by adopting the data processing method including the above steps S1410 to S1420, the data sample includes a sample feature, the determining a target data distribution rationality determination value according to the sample feature, and obtaining a target determination value according to the sample label rationality determination value includes: and obtaining a target judgment value according to the sample label rationality judgment value and the target data distribution rationality judgment value. According to the scheme of the embodiment of the invention, the distribution rationality of the data of the sample characteristics is judged through the sample characteristics, so that the quality of the data sample is judged, and the effectiveness of judging the quality of the data sample is improved.

In an embodiment, as shown in fig. 15, step S1410 is further described, and step S1410 may include, but is not limited to, step S1411, step S1412, step S1413, step S1414, and step S1415.

Step S1411: and traversing the sample characteristics of the data sample according to the currently traversed target sample characteristics.

In this step, since the data sample includes the sample feature, the sample feature of the data sample can be traversed to obtain the target sample feature in the currently traversed data sample, which is convenient for obtaining the probability density distribution judgment value of the target sample feature and the abnormal value judgment value of the target sample feature in the subsequent steps.

Step S1412: and acquiring a probability density distribution judgment value of the target sample characteristic and an abnormal value judgment value of the target sample characteristic.

In this step, the probability density distribution judgment value is used to judge whether the target sample feature has a serious long tail effect, and the abnormal value judgment value is used to judge whether the distribution of the data of the target sample feature has an abnormal value. In an optional embodiment, the probability density distribution judgment value of the target sample feature is calculated by combining the 3 σ principle and the long tail effect in the related art, and the abnormal value judgment value of the target sample feature is calculated by combining the boxplot technology.

It should be noted that the obtaining of the probability density distribution judgment value of the target sample characteristic and the abnormal value judgment value of the target sample characteristic is to facilitate obtaining of a data distribution rationality judgment value in the following.

Step S1413: and obtaining a data distribution rationality judgment value of the target sample characteristic according to the probability density distribution judgment value and the abnormal value judgment value.

In this step, the data distribution reasonability determination value of the target sample feature is obtained according to the probability density distribution determination value and the abnormal value determination value, and the probability density distribution determination value and the abnormal value determination value may be weighted and added respectively to obtain the data distribution reasonability determination value of the target sample feature.

Step S1414: and counting the condition that the data distribution rationality judgment value is greater than a preset judgment threshold value to obtain a third statistical value.

In this step, the larger the data distribution rationality evaluation value is, the worse the data distribution of the sample characteristics is. And when the data distribution rationality judgment value is larger than the preset judgment threshold value, the data distribution of the target sample characteristics is poor, and the third statistical value is obtained so as to conveniently obtain the target sample characteristic data distribution rationality judgment value in the follow-up process.

Step S1415: and obtaining a target sample characteristic data distribution rationality judgment value according to the third statistical value.

In this step, the target sample characteristic data distribution rationality judgment value is obtained according to the third statistical value, and the third statistical value may be divided by the total number of the sample characteristics of the data sample, so as to obtain the target sample characteristic data distribution rationality judgment value.

In this embodiment, by using the data processing method including the above steps S1411 to S1415, the data sample is traversed, the feature of the currently traversed target sample is obtained, the probability density distribution determination value of the feature of the target sample and the abnormal value determination value of the feature of the target sample are obtained, the data distribution reasonability determination value of the feature of the target sample is obtained according to the probability density distribution determination value and the abnormal value determination value, a third statistical value is obtained when the statistical data distribution reasonability determination value is greater than a preset determination threshold, and the data distribution reasonability determination value of the feature of the target sample is obtained according to the third statistical value. According to the scheme of the embodiment of the invention, the rationality judgment value of the data distribution of the sample characteristics is obtained through the probability density distribution judgment value of the sample characteristics and the abnormal value judgment value of the sample characteristics, so that the aim of improving the effectiveness of the judgment of the quality of the data sample can be fulfilled.

It is noted that, for a sample feature of a data sample, if a large number of abnormal values exist in the data distribution of the sample feature or a serious long tail effect exists, the effectiveness of model optimization learning and the result of model evaluation are seriously affected, so that the period of model development and landing is prolonged. If the data distribution of each sample characteristic of the data sample can be judged, on one hand, the quality of the data sample can be optimized by perfecting the acquisition of the data sample, so that the model development and the landing are accelerated, on the other hand, a technical scheme decision reference can be provided for the preprocessing of the data sample, and the quality of a data preprocessing scheme is improved, so that the model development and the landing are accelerated.

In an embodiment, as shown in fig. 16, the step S140 is further described, and the step S140 may include, but is not limited to, the steps S141, S142, and S143.

Step S141: and converting the target judgment value into grade information according to a preset grade threshold value.

In this step, a preset level threshold is used to divide the target judgment value within a certain range into the same levels, and the level information may be text information, in an optional embodiment, when the target judgment value includes a sample label rationality judgment value, the sample label rationality may be output as information of three levels, i.e., high, medium, and low, when the target judgment value includes a target sample characteristic data distribution rationality judgment value, the target sample characteristic data distribution rationality may be output as information of three levels, i.e., high, medium, and low, and when the target judgment value includes a target sample characteristic capability judgment value, the target sample characteristic capability may be output as information of three levels, i.e., strong, medium, and weak.

Step S142: and obtaining a comprehensive judgment value according to the target judgment value and the preset weight of the target judgment value.

In this step, when the target determination value includes a plurality of determination values, the determination values are added according to preset weights, respectively, to obtain a comprehensive determination value, in an optional embodiment, when the target determination value includes a sample label rationality determination value, a target sample characteristic data distribution rationality determination value, and a target sample characteristic characterization capability determination value, the sample label rationality determination value, the target sample characteristic data distribution rationality determination value, and the target sample characteristic capability determination value are added according to preset weights, respectively, to obtain a comprehensive determination value, and the comprehensive determination value refers to a specific numerical value of comprehensive determination.

Step S143: and obtaining a target judgment result according to the grade information and the comprehensive judgment value.

In this step, a target judgment result is obtained according to the level information and the comprehensive judgment value, the level information and the comprehensive judgment value may be directly output as the target judgment result, or the comprehensive judgment value may be converted into a corresponding rating level, and the rating level and the level information may be output as the target judgment result.

In this embodiment, by using the data processing method including the steps S141 to S143, the target determination value is converted into the level information according to the preset level threshold, the comprehensive determination value is obtained according to the target determination value and the preset weight of the target determination value, and the target determination result is obtained according to the level information and the comprehensive determination value. According to the scheme of the embodiment of the invention, the grade information and the comprehensive judgment value of the target judgment value are obtained through processing, and the target judgment result is obtained according to the grade information and the comprehensive judgment value, so that the aims of standardizing the target judgment result and enhancing the readability of the target judgment result can be achieved.

In an embodiment, as shown in fig. 17, the data processing method is further described, and the data processing method may further include, but is not limited to, step S150 and step S160.

Step S150: and sending the target judgment result to the client so that the client displays the target judgment result.

In this step, the client refers to an issuing end of the data sample, in an optional embodiment, the client refers to a user computer, and the server end that generates the target judgment result refers to a server physical machine. The target judgment result is output to the client side through the Internet or a local area network, so that the aim of facilitating the client to judge the accuracy of the data sample according to the target judgment result can be achieved.

It should be noted that any storage manner in the related art may be used to store the target determination result, for example, a file or a database in any related art, and the client may be provided with a display interface of the target determination result, so that the client displays the target determination result.

Step S160: and generating a result file according to the target judgment result and the client information of the client, and sending the result file to the client.

In this step, the result file includes the target determination result, the client refers to the sending end of the data sample, in an optional embodiment, the client refers to the user computer, and the server end that generates the target determination result refers to the server physical machine. The target judgment result is output to the client side through the Internet or a local area network, so that the aim of facilitating the client to judge the accuracy of the data sample according to the target judgment result can be achieved.

It should be noted that the client information of the client may be port information of a specific computer of the client, and in an optional embodiment, the user of the client may be a registered user of the system, and the client information may include a user name of the user, information of the user I D, and the like, which is not described herein again. And generating a result file according to the client information and the target judgment result, and sending the result file to the client, so that the result file can be more personalized.

In this embodiment, the data processing method including the steps S150 to S160 is adopted to send the target determination result to the client, so that the client displays the target determination result; or generating a result file according to the target judgment result and the client information of the client, and sending the result file to the client. According to the scheme of the embodiment of the invention, the data sample is sent by the client, and the target judgment result is sent back to the client after the target judgment result is obtained, so that the aim of feeding back the target judgment result to the client can be achieved.

In addition, an embodiment of the present invention also provides a data processing apparatus including: a memory, a processor, and a computer program stored on the memory and executable on the processor.

The processor and memory may be connected by a bus or other means.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the data processing method of the above-described embodiment are stored in the memory, and when executed by the processor, the data processing method of the above-described embodiment is executed, for example, the method steps S110 to S140 in fig. 1, the method steps S121 to S124 in fig. 2, the method steps S310 to S330 in fig. 3, the method steps S410 to S420 in fig. 4, the method steps S510 to S530 in fig. 5, the method steps S610 to S620 in fig. 6, the method steps S710 to S730 in fig. 7, the method steps S810 to S850 in fig. 8, the method steps S910 to S960 in fig. 9, the method steps S1010 to S1040 in fig. 10, the method step S1110 in fig. 11, the method steps S1041 to S1042 in fig. 12, the method steps S951 to S954 in fig. 13, the method steps S1420 to S1410 in fig. 14, the method steps S1415 in fig. 15, and the method steps S141 to S143 in fig. 1 to S143, Method steps S150 to S160 in fig. 17.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions, which are executed by a processor or a controller, for example, by a processor in the above-mentioned apparatus embodiment, and can enable the above-mentioned processor to execute the data processing method in the above-mentioned embodiment, for example, execute the above-mentioned method steps S110 to S140 in fig. 1, method steps S121 to S124 in fig. 2, method steps S310 to S330 in fig. 3, method steps S410 to S420 in fig. 4, method steps S510 to S530 in fig. 5, method steps S610 to S620 in fig. 6, method steps S710 to S730 in fig. 7, method steps S810 to S850 in fig. 8, method steps S910 to S960 in fig. 9, method steps S1010 to S1040 in fig. 10, method steps S1110 in fig. 11, Method steps S1041 to S1042 in fig. 12, method steps S951 to S954 in fig. 13, method steps S1410 to S1420 in fig. 14, method steps S1411 to S1415 in fig. 15, method steps S141 to S143 in fig. 16, and method steps S150 to S160 in fig. 17.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A method of data processing, comprising:

and obtaining a target judgment result according to the target judgment value.

2. The data processing method of claim 1, wherein the obtaining of the consistency parameter between the sample labels of the adjacent data samples and the determining of the number information of reasonable samples in the data samples according to the consistency parameter comprises:

traversing the data sample to obtain a currently traversed target data sample;

obtaining a proximity data sample, the proximity data sample being a smaller distance from the target data sample than a non-proximity data sample;

obtaining a consistency parameter between the sample label of the adjacent data sample and the sample label of the target data sample, and judging whether the target data sample is a reasonable sample according to the consistency parameter;

and determining the quantity information of reasonable samples in the data samples according to the judgment result of the target data samples.

3. The data processing method of claim 2, wherein the determining a consistency parameter based on the sample label of the neighboring data sample and the sample label of the target data sample comprises:

acquiring a scene type of a target operation scene;

when the scene type is a classification scene, determining the quantity value of the sample label of the adjacent data sample which is the same as the quantity value of the sample label of the target data sample;

determining the quantitative value as a consistency parameter between the sample label of the adjacent data sample and the sample label of the target data sample.

4. The data processing method of claim 3, wherein said determining whether the target data sample is a legitimate sample based on the consistency parameter comprises:

if the quantity value is larger than or equal to a preset quantity threshold value, determining that the target data sample is a reasonable sample;

or,

and if the quantity value is smaller than a preset quantity threshold value, determining that the target data sample is not a reasonable sample.

5. The data processing method of claim 2, wherein said obtaining a consistency parameter between the sample label of the adjacent data sample and the sample label of the target data sample comprises:

acquiring a scene type of a target operation scene;

when the scene type is a regression scene, obtaining a variation coefficient of a sample label of the adjacent data sample according to the sample label of the adjacent data sample and the sample label of the target data sample;

determining the coefficient of variation as a consistency parameter between the sample labels of the neighboring data samples and the sample label of the target data sample.

6. The data processing method of claim 5, wherein said determining whether the target data sample is a legitimate sample from the coefficient of variation comprises:

if the coefficient of variation is smaller than or equal to a preset coefficient threshold, determining that the target data sample is a reasonable sample;

or,

and if the coefficient of variation is larger than a preset coefficient threshold value, determining that the target data sample is not a reasonable sample.

7. The data processing method of claim 1, wherein the data samples include sample features, the data processing method further comprising:

acquiring a scene type of a target operation scene;

obtaining a target sample characteristic characterization capability judgment value according to the scene type and the data sample with the sample characteristics;

the obtaining of the target judgment value according to the sample label rationality judgment value comprises the following steps:

and obtaining the target judgment value according to the sample label rationality judgment value and the target sample characteristic characterization capability judgment value.

8. The data processing method of claim 7, wherein the scene type is a classification scene, and obtaining a target sample feature characterization capability judgment value according to the scene type and the data sample with sample features comprises:

acquiring a neural network model;

training the neural network model by using the data sample to obtain a training result;

predicting the neural network model according to the data sample and the training result to obtain a prediction result, and obtaining a model judgment index value according to the prediction result;

determining a first sample feature characterization capability judgment value according to the model judgment index value;

and determining the first sample characteristic characterization capability judgment value as the target sample characteristic characterization capability judgment value.

9. The data processing method of claim 7, wherein the scene type is a regression scene, and obtaining a target sample feature characterization capability judgment value according to the scene type and the data sample with the sample feature comprises:

acquiring a neural network model;

the data sample predicts the neural network model according to the training result to obtain a prediction result, and a model judgment index value is obtained according to the prediction result;

obtaining a second sample characteristic characterization capability judgment value according to the correlation coefficient of the sample characteristic and the sample label;

and obtaining the target sample characteristic characterization capability judgment value according to the first sample characteristic characterization capability judgment value and the second sample characteristic characterization capability judgment value.

10. The data processing method according to any one of claims 8 or 9, wherein said determining a first sample feature characterization capability judgment value based on said model judgment index value comprises:

if the current iteration times are smaller than the preset iteration times, increasing the complexity of the neural network model;

training the neural network model with the improved complexity by using the data sample, and updating the training result;

predicting the neural network model with the complexity increased according to the training result to obtain a new model judgment index value;

updating a first statistical value according to the change information of the model judgment index value, wherein the first statistical value is set according to the model judgment index value.

11. The data processing method according to any one of claims 8 or 9, wherein said determining a first sample feature characterization capability judgment value based on said model judgment index value comprises:

and if the current iteration number is equal to a preset iteration number, obtaining the first sample characteristic characterization capability judgment value according to a first statistical value and the current iteration number, wherein the first statistical value is set according to the model judgment index value.

12. The data processing method of claim 10, wherein the updating the first statistical value based on the change information of the model judgment index value includes:

if the model judgment index value is increased, accumulating the first statistical value to obtain an updated first statistical value;

or,

and if the model judgment index value is larger than a preset judgment index value, accumulating the first statistical value to obtain the updated first statistical value.

13. The data processing method of claim 9, wherein the obtaining a second sample feature characterization capability judgment value according to the correlation coefficient between the sample feature and the sample label comprises:

traversing the sample characteristics of the data sample to obtain the currently traversed target sample characteristics;

obtaining the correlation coefficient according to the target sample characteristics and the sample label;

setting a second statistical value according to the correlation coefficient;

and obtaining a second sample characteristic characterization capability judgment value according to the second statistical value and the total number of the sample characteristics.

14. The data processing method of claim 1, wherein the data samples further include sample characteristics, the data processing method further comprising:

determining a target data distribution rationality judgment value according to the sample characteristics;

and obtaining the target judgment value according to the sample label rationality judgment value and the target data distribution rationality judgment value.

15. The data processing method according to claim 14, wherein the determining a target data distribution rationality judgment value based on the sample feature includes:

acquiring a probability density distribution judgment value of the target sample characteristic and an abnormal value judgment value of the target sample characteristic;

obtaining a data distribution rationality judgment value of the target sample characteristic according to the probability density distribution judgment value and the abnormal value judgment value;

counting the condition that the data distribution rationality judgment value is larger than a preset judgment threshold value to obtain a third statistical value;

and obtaining a target sample characteristic data distribution rationality judgment value according to the third statistical value.

16. The data processing method of claim 1, wherein obtaining a target judgment result according to the target judgment value comprises:

converting the target judgment value into grade information according to a preset grade division threshold value;

obtaining a comprehensive judgment value according to the target judgment value and the preset weight of the target judgment value;

and obtaining the target judgment result according to the grade information and the comprehensive judgment value.

17. The data processing method of claim 1, wherein the data processing method further comprises:

sending the target judgment result to the client, so that the client displays the target judgment result;

or,

and generating a result file according to the target judgment result and the client information of the client, and sending the result file to the client.

18. A data processing apparatus comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the data processing method according to any one of claims 1 to 17 when executing the computer program.

19. A computer-readable storage medium storing computer-executable instructions for performing the data processing method of any one of claims 1 to 17.