CN117786430A

CN117786430A - Data processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN117786430A
Application number: CN202211272085.XA
Authority: CN
Inventors: 朱曦阳; 冯云龙
Original assignee: Shuhang Technology Beijing Co ltd
Current assignee: Shuhang Technology Beijing Co ltd
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2024-03-29

Abstract

The application discloses a data processing method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring a first characteristic data set of a data set to be marked and a second characteristic data set of the marked data set; clustering is carried out on the first characteristic data set and the second characteristic data set to obtain a clustering result; and obtaining a second category of the data set to be marked according to the clustering result and the first category of the marked data set.

Description

Data processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus, an electronic device, and a computer readable storage medium.

Background

With the development of artificial intelligence technology, the application of artificial intelligence technology is becoming wider and wider, and the use of artificial intelligence models to classify data is included. Before classifying an artificial intelligence model, training the artificial intelligence model by using training data with categories is needed to improve the classification accuracy of the artificial intelligence model, and the training effect is better as the number of training data is larger. How to obtain data with categories is of great importance in improving the training effect of artificial intelligence models.

Disclosure of Invention

The application provides a data processing method and device, electronic equipment and a computer readable storage medium.

In a first aspect, a data processing method is provided, the method comprising:

acquiring a first characteristic data set of a data set to be marked and a second characteristic data set of the marked data set;

clustering is carried out on the first characteristic data set and the second characteristic data set to obtain a clustering result;

and obtaining a second category of the data set to be marked according to the clustering result and the first category of the marked data set.

In this embodiment, the data processing apparatus obtains, when acquiring a first feature data set of the data set to be annotated and a second feature data set of the annotated data set, a clustering result of feature data in the first feature data set and feature data in the second feature data set by performing a clustering process on the first feature data set and the second feature data set. And then, obtaining a second class of the data set to be marked according to the clustering result and the first class of the marked data set, thereby determining the class of the unlabeled data through the marked data set.

In combination with any one of the embodiments of the present application, the clustering method for obtaining a clustering result by performing clustering processing on the first feature data set and the second feature data set includes:

Reducing the dimension of the feature data in the first feature data set to obtain a third feature data set;

and clustering the second characteristic data set and the third characteristic data set to obtain the clustering result.

In this embodiment, since the dimension of the feature data in the third feature data set is lower than the dimension of the feature data in the first feature data set, the data processing amount generated by the clustering processing can be reduced by performing the clustering processing on the second feature data set and the third feature data set.

In combination with any one of the embodiments of the present application, the clustering method for obtaining the clustering result by performing clustering processing on the second feature data set and the third feature data set includes:

clustering the second characteristic data set and the third characteristic data set to obtain at least two fourth characteristic data sets; the at least two fourth feature data sets include a fifth feature data set and a sixth feature data set;

combining the fifth feature data set and the sixth feature data set under the condition that the duty ratio of the matched data pair in the feature data pair is larger than or equal to a first threshold value, so as to obtain a seventh feature data set; the pair of feature data includes one feature data in the fifth feature data set and one feature data in the sixth feature data set; the matching data pair is the characteristic data pair comprising two mutually matched characteristic data;

And obtaining the clustering result according to the seventh characteristic data set and the characteristic data sets except the fifth characteristic data set and the sixth characteristic data set in the at least two fourth characteristic data sets.

In this embodiment, the data processing apparatus further combines the fifth feature data set and the sixth feature data set in the at least two fourth feature data sets based on the ratio of the matching data pair in the feature data pair to obtain the seventh feature data set, when the at least two fourth feature data sets are obtained by performing the clustering process on the second feature data set and the third feature data set. And obtaining a clustering result according to the seventh characteristic data set and the at least two fourth characteristic data sets except the fifth characteristic data set and the sixth characteristic data set, wherein the clustering result is obtained by correcting the at least two fourth characteristic data sets obtained by the clustering process based on the duty ratio of the matched data pair in the characteristic data pair, thereby improving the accuracy of the clustering result.

In combination with any one of the embodiments of the present application, the method further includes:

and under the condition that the modification instruction aiming at the second category is detected, modifying the second category according to the modification instruction to obtain a third category of the data set to be annotated.

In this embodiment, after the data processing device obtains the second class of the data set to be annotated, the user may modify the class of the data set to be annotated into the third class by inputting a modification instruction to the data processing device under the condition that the second class is determined to be wrong, so as to improve the accuracy of the class of the data set to be annotated.

In combination with any one of the embodiments of the present application, the obtaining the second feature dataset of the labeled dataset includes:

acquiring a data set to be confirmed and a classification model; the data set to be confirmed comprises first data;

identifying a fourth category of the first data using the classification model;

taking the data set to be confirmed as the marked data set under the condition that the fourth category is the same as the category indicated by the label of the first data;

removing the first data in the data set to be confirmed to obtain the marked data set under the condition that the fourth category is different from the category indicated by the label of the first data;

and carrying out feature extraction processing on the marked data set to obtain the second feature data set.

In this embodiment, when the data processing apparatus determines the fourth category of the first data using the classification model, it determines whether the category indicated by the tag of the first data is correct based on the fourth category, so that accuracy of data in the labeled data set can be improved, and accuracy of the category of the second feature data set can be improved when the second feature data set is obtained by performing feature extraction processing on the labeled data set.

In combination with any one of the embodiments of the present application, the data in the to-be-annotated data set and the data in the annotated data set are both images.

In combination with any one of the embodiments of the present application, the first category and the second category are both commodity categories.

Based on the implementation mode, the data processing device can obtain the commodity category of the data in the data set to be marked. For example, the data in the set of data to be annotated and the data in the annotated set of data are both images, and the first category of the images in the annotated set of data is the category of the merchandise in the image. Through the technical scheme, the category of the commodity in the image in the data set to be marked can be determined based on the first category of the marked data set, and the second category is obtained.

obtaining a model to be trained;

obtaining a training data set according to the second category, the data set to be marked and the marked data set;

and training the model to be trained by using the training data set to obtain a trained model.

In this embodiment, the data processing device may take the data set to be annotated as training data if a second class of data sets to be annotated is obtained. Therefore, according to the second category, the data set to be annotated and the annotated data set, the training data set is obtained, which is equivalent to expanding the quantity of training data. Thus, the training data set is used for training the model to be trained, and the training effect can be improved.

In combination with any one of the embodiments of the present application, the training data set includes second data and third data, and a category of the second data is the same as a category of the third data;

the training the model to be trained by using the training data set to obtain a trained model, comprising:

determining a first similarity of the second data and the third data;

obtaining a first loss according to the first similarity; the first similarity is inversely related to the first loss;

and updating parameters of the model to be trained according to the first loss to obtain the trained model.

In this embodiment, the second data and the third data have the same category, and the greater the similarity between the second data and the third data should be, the better. The data processing device updates the parameters of the model to be trained according to the first loss, that is to say, the first loss can be characterized as the optimization direction of the parameters of the model to be trained. Therefore, the data processing device obtains the first loss according to the first similarity under the condition that the first similarity and the first loss are in negative correlation, which is equivalent to taking the first similarity as an optimization direction of the parameters of the model to be trained. Thus, according to the first loss, the parameters of the model to be trained are updated, the first similarity between the second data and the third data determined by the model to be trained can be improved, namely, the distance between the characteristic data of the second data and the characteristic data of the third data is reduced, and therefore the recognition accuracy of the model to be trained can be improved.

In combination with any of the embodiments of the present application, the training data set further includes fourth data, and the category of the second data is different from the category of the fourth data;

before the obtaining the first loss according to the first similarity, the method further includes:

determining a second similarity of the second data and the fourth data;

and obtaining a first loss according to the first similarity, wherein the method comprises the following steps:

obtaining the first loss according to the first similarity and the second similarity; the first loss is positively correlated with the second similarity.

In this embodiment, the second data and the fourth data have different categories, and the smaller the similarity between the second data and the fourth data should be, the better. The data processing device updates the parameters of the model to be trained according to the first loss, that is to say, the first loss can be characterized as the optimization direction of the parameters of the model to be trained. Therefore, the data processing device obtains the first loss according to the first similarity and the second similarity under the condition that the first similarity and the first loss are in negative correlation and the second similarity and the first loss are in positive correlation, which is equivalent to taking the first similarity and the second similarity as parameters of the model to be trained as an optimization direction. Thus, according to the first loss, the parameters of the model to be trained are updated, the first similarity between the second data and the third data determined by the model to be trained can be improved, the second similarity between the second data and the fourth data determined by the model to be trained can be reduced, namely, the distance between the characteristic data of the second data and the characteristic data of the third data is reduced, and the distance between the characteristic data of the second data and the characteristic data of the fourth data is pulled, so that the recognition accuracy of the model to be trained can be improved.

In a second aspect, there is provided a data processing apparatus, the apparatus comprising:

the acquisition unit is used for acquiring a first characteristic data set of the data set to be marked and a second characteristic data set of the marked data set;

the first processing unit is used for obtaining a clustering result by carrying out clustering processing on the first characteristic data set and the second characteristic data set;

and the second processing unit is used for obtaining a second category of the data set to be marked according to the clustering result and the first category of the marked data set.

In combination with any one of the embodiments of the present application, the first processing unit is configured to:

In combination with any one of the embodiments of the present application, the second processing unit is further configured to:

In combination with any one of the embodiments of the present application, the obtaining unit is configured to:

identifying a fourth category of the first data using the classification model;

With reference to any embodiment of the present application, the obtaining unit is further configured to obtain a model to be trained;

the second processing unit is further configured to:

The second processing unit is used for:

determining a first similarity of the second data and the third data;

the second processing unit is further configured to:

determining a second similarity of the second data and the fourth data;

In a third aspect, an electronic device is provided, including: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform a method as described in the first aspect and any one of its possible implementations.

In a fourth aspect, there is provided another electronic device comprising: a processor, transmission means, input means, output means and memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to carry out the method as described in the first aspect and any one of its possible implementations.

In a fifth aspect, there is provided a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out a method as in the first aspect and any one of its possible implementations.

In a sixth aspect, a computer program product is provided, the computer program product comprising a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect and any one of the possible implementations thereof.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be understood that, in the present application, "at least one (item)" means one or more, "a plurality" means two or more, "at least two (items)" means two or three and three or more, "and/or" for describing an association relationship of an association object, three kinds of relationships may exist, for example, "a and/or B" may mean: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" may indicate that the context-dependent object is an "or" relationship, meaning any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural. The character "/" may also represent divisors in mathematical operations, e.g., a/b=a divided by b; 6/3=2. "at least one of the following" or its similar expressions.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The execution body of the embodiment of the application is a data processing device, where the data processing device may be any electronic device capable of executing the technical scheme disclosed in the embodiment of the method of the application. Alternatively, the data processing device may be one of the following: cell-phone, computer, panel computer, wearable smart machine.

It should be understood that the method embodiments of the present application may also be implemented by way of a processor executing computer program code. Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application. Referring to fig. 1, fig. 1 is a flow chart of a data processing method according to an embodiment of the present application.

101. A first characteristic data set of the data set to be marked and a second characteristic data set of the marked data set are obtained.

In this embodiment of the present application, the data in the data set to be annotated may be any type of data, for example, the data in the data set to be annotated is an image, for example, the data in the data set to be annotated is voice, for example, the data in the data set to be annotated is text. The data in the data set to be marked are all data to be marked, wherein the data to be marked refers to data without categories, namely the categories of the data to be marked are unknown.

The first feature data set comprises feature data of data to be marked in the data set to be marked. For example, the data set to be annotated includes data a to be annotated and data b to be annotated, and the first feature data set includes feature data of the data a to be annotated and feature data of the data b to be annotated. The feature data of the data to be marked comprise features of the data to be marked, and optionally, the feature data of the data to be marked is feature data.

In the embodiment of the present application, the type of the data in the marked data set is the same as the type of the data in the data set to be marked. For example, the data in the set of data to be annotated and the data in the set of annotated data are both images. For another example, the data in the set of data to be annotated and the data in the set of annotated data are both speech. For another example, the data in the dataset to be annotated is text. The data in the set of data to be annotated are annotated data, wherein annotated data refers to data having a category, i.e. the category of annotated data is known.

The second feature data set includes feature data of the annotated data in the annotated data set. For example, the annotated data set comprises annotated data a and annotated data b, and the second feature data set comprises feature data of annotated data a and feature data of annotated data b. The feature data of the marked data comprises features of the marked data, and optionally, the feature data of the marked data is feature data.

In one implementation of acquiring a first feature data set of a data set to be annotated, a data processing apparatus receives a first feature data set input by a user through an input component. The input assembly includes at least one of: keyboard, mouse, touch screen, touch pad, audio input device.

In another implementation manner of acquiring the first characteristic data set of the data set to be marked, the data processing device receives the first characteristic data set sent by the terminal. The terminal may be any of the following: cell phone, computer, panel computer, server.

In a further implementation of the acquisition of the first characteristic dataset of the dataset to be annotated, the data processing means acquire the dataset to be annotated. And obtaining a first characteristic data set by carrying out characteristic extraction processing on the data to be marked in the data set to be marked.

In one implementation of acquiring a second feature data set of the annotated data set, the data processing apparatus receives the second feature data set entered by a user via the input component.

In another implementation of obtaining the second feature data set of the annotated data set, the data processing device receives the second feature data set sent by the terminal.

In yet another implementation of the second feature dataset of the annotated dataset, the data processing apparatus acquires the annotated dataset. And obtaining a second characteristic data set by carrying out characteristic extraction processing on the marked data in the marked data set.

It will be appreciated that the step of the data processing apparatus performing the step of acquiring the first feature data set and the step of performing the step of acquiring the second feature data set may be performed simultaneously or separately, as this is not limiting in this application.

102. And clustering the first characteristic data set and the second characteristic data set to obtain a clustering result.

In this embodiment, the data processing apparatus performs a clustering process on the first feature data set and the second feature data set, that is, performs a clustering process on feature data in the first feature data set and feature data in the second feature data set. At least one cluster is obtained by the clustering process, in other words, the clustering result includes at least one cluster, wherein each cluster includes at least one feature data, and the category within each cluster is the same.

103. And obtaining a second category of the data set to be marked according to the clustering result and the first category of the marked data set.

In this embodiment, the class of the labeled dataset is a first class, and the class of the dataset to be labeled obtained by executing step 103 is a second class. Since the marked data sets have categories, the feature data in the second feature data set have categories, and it can be determined which feature data in the first feature data set have the same category as the feature data in the second feature data set according to the clustering result. Therefore, the data processing device can obtain the category of the characteristic data in the first characteristic data set according to the characteristic data and the clustering result in the second characteristic data set, and further obtain the category of the data set to be marked, namely the second category.

In the embodiment of the application, under the condition that the first characteristic data set of the data set to be marked and the second characteristic data set of the marked data set are obtained, the data processing device performs clustering processing on the first characteristic data set and the second characteristic data set to obtain a clustering result of the characteristic data in the first characteristic data set and the characteristic data in the second characteristic data set. And then, obtaining a second class of the data set to be marked according to the clustering result and the first class of the marked data set, thereby determining the class of the unlabeled data through the marked data set.

As an alternative embodiment, the data processing apparatus performs the following steps in performing step 102:

201. and reducing the dimension of the feature data in the first feature data set to obtain a third feature data set.

The data processing apparatus may reduce the dimension of any one of the first feature data set by performing step 201. Optionally, the data processing device obtains the third feature data set by reducing all feature data in the first feature data set.

In one possible implementation, the data processing apparatus reduces the dimension of the feature data in the first feature data set by a principal component analysis (principal component analysis, PCA) method, resulting in a third feature data set.

202. And clustering the second characteristic data set and the third characteristic data set to obtain the clustering result.

Alternatively, the clustering in step 202 is implemented by a density-based clustering algorithm (DBSCAN). At this time, the data processing apparatus first reduces the dimension of the feature data in the first feature data set, and may reduce the dimension of the feature data in the first feature data set while retaining the information carried by the feature data in the first feature data set, to obtain the third feature data set. In this way, by performing the clustering processing on the second feature data set and the third feature data set, the data processing amount generated by the clustering processing can be reduced. And because DBSCAN is a density-based clustering algorithm, the dimension of the characteristic data in the third characteristic data set is lower than that of the characteristic data in the first characteristic data set, the clustering result is obtained by carrying out clustering processing on the second characteristic data set and the third characteristic data set, and compared with the clustering result obtained by carrying out clustering processing on the first characteristic data set and the second characteristic data set, the clustering accuracy is higher.

As an alternative embodiment, the data processing apparatus performs the following steps in performing step 202:

301. and clustering the second characteristic data set and the third characteristic data set to obtain at least two fourth characteristic data sets.

In this embodiment of the present application, the fourth feature data set is a feature data set obtained by clustering, in other words, the fourth feature data set may include only feature data in the first feature data set, or only feature data in the second feature data set, and the fourth feature data set may further include feature data in the first feature data set and feature data in the second feature data set, and the types of feature data in the fourth feature data set are the same.

In this embodiment of the present application, the at least two fourth feature data sets include a fifth feature data set and a sixth feature data set, that is, the fifth feature data set and the sixth feature data set are any two of the at least two fourth feature data sets.

302. And combining the fifth characteristic data set and the sixth characteristic data set to obtain a seventh characteristic data set under the condition that the duty ratio of the matched data pair in the characteristic data pair is larger than or equal to a first threshold value.

In the embodiment of the present application, the pair of feature data includes one feature data in the fifth feature data set and one feature data in the sixth feature data set. For example, the fifth feature data set includes feature data a and feature data b, and the sixth feature data set includes feature data c, and then the feature data a and the feature data c may form a feature data pair, and the feature data b and the feature data c may also form a feature data pair.

In this embodiment of the present application, the matching data pair is a feature data pair including two feature data that match each other, that is, the two feature data in the matching data pair are matched with each other. For example, the feature data pair 1 includes feature data a and feature data b, and the feature data pair 2 includes feature data a and feature data c. If the feature data a matches the feature data b, the feature data pair 1 is a matching data pair. If the feature data a matches the feature data c, the feature data pair 2 is a matching data pair.

The two feature data are matched with each other, which means that the probability of the two feature data being the same category is high. Optionally, in a case where the similarity between the two feature data is greater than or equal to the second threshold, the two feature data match each other. In the case where the similarity between the two feature data is smaller than the second threshold value, the two feature data do not match.

The ratio of the matching data pair to the characteristic data pair is high, which means that the probability that the characteristic data in the fifth characteristic data set and the characteristic data in the sixth characteristic data set belong to the same category is high. And the ratio of the matching data pair in the feature data pair is low, which means that the probability that the feature data in the fifth feature data set and the feature data in the sixth feature data set belong to the same category is low. Therefore, in the case that the ratio of the matching data pair to the feature data pair is high, the fifth feature data set and the sixth feature data set are combined, and the accuracy of the clustering result can be improved.

In this embodiment of the present application, the data processing apparatus determines, based on the first threshold, whether the duty ratio of the matching data pair in the feature data pair is high or low. Specifically, the ratio of the matching data pair in the characteristic data pair is greater than or equal to a first threshold value, which indicates that the ratio of the matching data pair in the characteristic data pair is higher than the first threshold value, and the ratio of the matching data pair in the characteristic data pair is smaller than the first threshold value, which indicates that the ratio of the matching data pair in the characteristic data pair is lower. Thus, the data processing apparatus combines the fifth feature data set and the sixth feature data set to obtain the seventh feature data set in the case where the duty ratio of the matching data pair in the feature data pair is greater than or equal to the first threshold.

303. And obtaining the clustering result according to the seventh characteristic data set and the characteristic data sets except the fifth characteristic data set and the sixth characteristic data set in the at least two fourth characteristic data sets.

For convenience of description, the feature data sets other than the above-described fifth feature data set and the above-described sixth feature data set among the at least two fourth feature data sets are hereinafter referred to as feature data sets to be processed.

The data processing apparatus performs the merging of the fifth feature data set and the sixth feature data set to obtain the seventh feature data set by performing step 302, and further determines whether there is a combinable feature data set in the seventh feature data set and the feature data set to be processed by performing step 303. And taking the seventh feature data set and the feature data set to be processed as a clustering result when the seventh feature data set and the feature data set to be processed are determined to have no combinable feature data set. And under the condition that the seventh feature data set and the feature data set to be processed have combinable feature data sets, combining the combinable feature data sets until no combinable feature data sets exist, and taking all feature data sets as clustering results.

As an alternative embodiment, the data processing device further performs the steps of:

401. and under the condition that the modification instruction aiming at the second category is detected, modifying the second category according to the modification instruction to obtain a third category of the data set to be marked.

In the embodiment of the present application, the modification instruction carries a modification operation on the second category. Optionally, the data processing device may display the unlabeled data and the second class when the second class is obtained, and the user may input a modification instruction for the second class to the data processing device when it is determined that the class of the unlabeled data is not the second class. And under the condition that the data processing device detects the modification instruction, modifying the second class according to the modification instruction to obtain a third class of the data set to be marked.

For example, the data set to be annotated includes an image a, the second category of the image a is pear, but the third category of the image a is determined to be apple after the image a is seen, and then a modification instruction for the second category of the image a is input to the data processing device so as to modify the category of the image a into a third category, namely apple.

In one possible implementation, the data processing apparatus receives a modification instruction input by a user through the input component. For example, the data processing apparatus displays the second category after obtaining the second category, and outputs information whether the second category needs to be modified. The user inputs a modification instruction for the second category to the data processing device through the touch display screen aiming at whether the second category of information needs to be modified.

In another possible implementation, the data processing apparatus receives a modification instruction sent by the terminal. For example, the data processing apparatus displays the second category after obtaining the second category, and outputs information whether the second category needs to be modified. The user inputs a modification instruction for the second category to the data processing device through the mobile phone aiming at whether the second category of information needs to be modified.

As an alternative embodiment, the data processing apparatus obtains the second feature data set of the annotated data set by performing the steps of:

501. and acquiring the data set to be confirmed and the classification model.

In this embodiment of the present application, the data in the data set to be confirmed has a category, and the data set to be confirmed includes first data, that is, the first data is any one data in the data set to be confirmed.

In the embodiment of the application, the classification model is a model for classifying data. For example, the type of data is an image and the classification model may be an image classification neural network. It should be understood that the classification model is a trained model, that is, the recognition accuracy of the classification model meets the expectation, in other words, the category of the data given by the classification model can be regarded as the correct category, and for any data, the classification model recognizes that the obtained category can be regarded as the category of the data.

For example, the accuracy expected is 98%, and then the classification model is identified with an accuracy greater than or equal to 98%. And identifying the category of the image a by using the classification model, and determining that the category of the image a is pear, wherein the category of the image a is pear.

In one implementation of acquiring a data set to be validated, a data processing apparatus receives a data set to be validated input by a user through an input component.

In another implementation manner of obtaining the data set to be confirmed, the data processing device receives the data set to be confirmed sent by the terminal.

In yet another implementation of obtaining the data set to be confirmed, the data processing device downloads data with categories from an open source data set website to obtain the data set to be confirmed.

Optionally, the data processing device downloads data with categories from an open source dataset website to obtain a downloaded dataset. And removing the repeated data in the downloaded data set to obtain the data set to be confirmed. In an implementation manner of removing duplicate data, a data processing device calculates a third similarity between any two data in a downloaded data set, and determines that two data corresponding to the third similarity are duplicate data when the third similarity is greater than or equal to a duplicate threshold, so as to remove any one of the two data. In the data set to be confirmed obtained based on the implementation mode, the third similarity between any two data is smaller than the repetition threshold.

In another implementation of removing duplicate data, the data processing apparatus calculates a message digest (MD 5) value of each data in the downloaded data set, respectively, treats at least two data in the downloaded data set having the same MD5 value as duplicate data, and retains any one of the duplicate data. In the data set to be confirmed obtained based on the implementation mode, the MD5 values of any two data are different.

In one implementation of obtaining a classification model, a data processing apparatus receives a classification model input by a user through an input component.

In another implementation of obtaining the classification model, the data processing apparatus receives the classification model sent by the terminal.

It should be understood that the step of the data processing apparatus performing the step of acquiring the data set to be validated and the step of performing the step of acquiring the classification model may be performed simultaneously or may be performed separately, which is not limited in this application.

502. And identifying a fourth category of the first data by using the classification model.

In this embodiment of the present application, the fourth category is a category of the first data given by the classification model, that is, the correct category of the first data is the fourth category.

503. And when the fourth category is the same as the category indicated by the tag of the first data, the data set to be confirmed is taken as the marked data set.

504. And removing the first data in the data set to be confirmed to obtain the marked data set under the condition that the fourth category is different from the category indicated by the label of the first data.

In this embodiment of the present application, the data in the to-be-confirmed data set includes a tag, and the tag indicates a class of the corresponding data. Because the fourth category is the correct category of the first data, whether the category indicated by the tag of the first data is correct or not can be judged according to the fourth category.

Therefore, the data processing apparatus uses the data set to be confirmed as the marked data set in the case that the fourth category is the same as the category indicated by the tag of the first data, and the accuracy of the category of the data in the marked data set can be improved. The fourth category is different from the category indicated by the tag of the first data, which indicates that the category indicated by the tag of the first data is wrong, so that the data processing device removes the first data in the data set to be confirmed to obtain the marked data set under the condition that the fourth category is different from the category indicated by the tag of the first data, namely, removes the first data from the data set to be confirmed to obtain the marked data set, and the accuracy of the category of the data in the marked data set can be improved.

505. And carrying out feature extraction processing on the marked data set to obtain the second feature data set.

It should be understood that the first data in this embodiment is only a description object selected for briefly describing the technical solution, and should not be understood that the data set to be confirmed includes only the first data, or that the classification model is only used to determine whether the category indicated by the label of the first data is correct. In practical applications, the data set to be confirmed may include n data, where n is greater than or equal to 1, and the data processing apparatus may also determine whether the category indicated by the tag of each data in the data set to be confirmed is correct, respectively, using the classification model.

As an alternative embodiment, the data in the set of data to be annotated and the data in the annotated set of data are both images.

As an alternative embodiment, the first category and the second category are both merchandise categories. Based on the implementation mode, the data processing device can obtain the commodity category of the data in the data set to be marked. For example, the data in the set of data to be annotated and the data in the annotated set of data are both images, and the first category of the images in the annotated set of data is the category of the merchandise in the image. Through the technical scheme, the category of the commodity in the image in the data set to be marked can be determined based on the first category of the marked data set, and the second category is obtained.

601. and obtaining a model to be trained.

In the embodiment of the application, the model to be trained can be any neural network. For example, the model to be trained may be composed of at least one network layer stack of a convolution layer, a pooling layer, a normalization layer, a full connection layer, a downsampling layer, an upsampling layer, a classifier. The embodiment of the application does not limit the structure of the model to be trained.

In one implementation of obtaining a model to be trained, a data processing apparatus receives a model to be trained input by a user through an input component.

In another implementation manner of obtaining the model to be trained, the data processing device receives the model to be trained sent by the terminal.

602. And obtaining a training data set according to the second category, the data set to be marked and the marked data set.

In this embodiment of the present application, the training data set includes data in the to-be-labeled data set and data in the labeled data set, where a category indicated by a label of the to-be-labeled data set is a second category, and a category indicated by a label of the labeled data set is a first category.

603. And training the model to be trained by using the training data set to obtain a trained model.

As an alternative embodiment, the training data set comprises second data and third data, wherein the category of the second data is the same as the category of the third data. For example, the second data and the third data are pear. The data processing apparatus performs the following steps in the process of performing step 603:

701. And determining the first similarity between the second data and the third data.

In one possible implementation manner, the second data and the third data are images, the category of the second data and the category of the third data are categories of objects in the images, and the first similarity represents similarity of the objects in the second data and the objects in the third data. For example, the object is a commodity, and the first similarity characterizes a similarity of the commodity in the second data and the commodity in the third data.

Optionally, the data processing device calculates cosine similarity between the feature data of the second data and the feature data of the third data, so as to obtain the first similarity.

In another possible implementation manner, the second data and the third data are text, the category of the second data and the category of the third data are categories of objects described by the text, and the first similarity characterizes similarity of semantics of the second data and semantics of the third data.

702. And obtaining a first loss according to the first similarity.

In this embodiment, the first similarity is inversely related to the first loss. In one possible implementation, the first penalty is l ₁ First similarity degreeIs s ₁ ，l ₁ 、s ₁ Satisfies the following formula:

l ₁ ＝k ₁ /s ₁ … formula (1)

Wherein k is ₁ Is a positive number.

In another possible implementation, the first penalty is l ₁ The first similarity is s ₁ ，l ₁ 、s ₁ Satisfies the following formula:

l ₁ ＝k ₁ /s ₁ +c ₁ … formula (2)

Wherein k is ₁ Is positive in number, c ₁ Is constant.

In yet another possible implementation, the first penalty is l ₁ The first similarity is s ₁ ，l ₁ 、s ₁ Satisfies the following formula:

wherein k is ₁ Is a positive number.

Optionally, the data processing apparatus uses the model to be trained to identify the category of data in the training dataset, resulting in a fifth category, before performing step 703. A difference is determined between the fifth category and the category of data in the training dataset. And obtaining a first loss according to the difference and the first similarity, wherein the difference is positively correlated with the first loss, and the first similarity is negatively correlated with the first loss. After the first penalty is obtained, step 703 is performed.

703. And updating parameters of the model to be trained according to the first loss to obtain the trained model.

In one possible implementation manner, the searching device updates parameters of the model to be trained according to the first loss until the first loss converges, and completes training of the model to be trained to obtain a trained model.

Optionally, the data processing device calculates the similarity of any two data with the same category in the training data set respectively, and then sums all the similarities to obtain the first similarity sum. And obtaining a first loss according to the first similarity sum, wherein the first loss and the first similarity sum are in negative correlation. At this time, according to the first loss, the parameters of the model to be trained are updated, so that the similarity of the data with the same category determined by the model to be trained can be improved, namely, the distance of the feature data with the same category is reduced, and therefore, the recognition accuracy of the model to be trained can be improved.

Optionally, the third data is the data with the lowest similarity with the second data in the similar data set, wherein the category of the data in the similar data set is the same as the category of the second data. For example, the training data set includes data a, data b, data c, and second data, wherein the category of the data a, the category of the data c, and the category of the second data are the same. At this time, the homogeneous dataset includes data a and data c. If the similarity between the data a and the second data is lower than the similarity between the data c and the second data, the third data is the data a.

And under the condition that the third data is the data with the lowest similarity with the second data in the similar data set, the characteristic data of the third data is the data with the farthest distance from the characteristic data of the second data in the similar data set, namely the third data is the data which is most likely to be identified by mistake in the similar data set. At this time, the data processing apparatus obtains the first loss according to the first similarity under the condition that the first similarity and the first loss are inversely related, which is equivalent to taking the first similarity as a parameter of the model to be trained as an optimization direction. Thus, according to the first loss, the parameters of the model to be trained are updated, the first similarity between the second data and the third data determined by the model to be trained can be improved, namely, the distance between the characteristic data of the second data and the characteristic data of the data most likely to be mistakenly identified in the similar data set is reduced, and therefore the identification accuracy of the model to be trained can be improved.

As an alternative embodiment, the training data set further comprises fourth data, wherein the category of the second data is different from the category of the fourth data. For example, the second data is pear, and the third data is apple. The data processing apparatus further performs the steps of:

801. and determining a second similarity between the second data and the fourth data.

The implementation of this step may be referred to as the implementation of determining the first similarity in step 701, which will not be described herein.

In case a second similarity is obtained, the data processing device performs the following steps in performing step 702:

802. and obtaining the first loss according to the first similarity and the second similarity.

In this embodiment, the first similarity is inversely related to the first loss, and the second similarity is positively related to the first loss. In one possible implementation, the first penalty is l ₁ The first similarity is s ₁ The second similarity is s ₂ ，l ₁ 、s ₁ 、s ₂ Satisfies the following formula:

l ₁ ＝k ₁ /s ₁ +k ₂ ×s ₂ … formula (4)

Wherein k is ₁ 、k ₂ All are positive numbers.

In another possible implementation, the first penalty is l ₁ The first similarity is s ₁ The second similarity is s ₂ ，l ₁ 、s ₁ 、s ₂ Satisfies the following formula:

l ₁ ＝k ₁ /s ₁ +k ₂ ×s ₂ +c ₂ … formula (5)

Wherein k is ₁ 、k ₂ All are positive numbers, c ₂ Is constant.

In yet another possible implementation, the first penalty is l ₁ The first similarity is s ₁ The second similarity is s ₂ ，l ₁ 、s ₁ 、s ₂ Satisfies the following formula:

wherein k is ₁ 、k ₂ All are positive numbers, c ₂ Is constant.

Optionally, the data processing apparatus uses the model to be trained to identify a class of data in the training dataset, resulting in a fifth class, before performing step 802. A difference is determined between the fifth category and the category of data in the training dataset. And obtaining a first loss according to the difference, the first similarity and the second similarity, wherein the difference is positively correlated with the first loss, the first similarity is negatively correlated with the first loss, and the second similarity is positively correlated with the first loss.

Optionally, the data processing device calculates the similarity of any two data with different categories in the training data set respectively, and then sums all the similarities to obtain a second similarity sum. And obtaining a first loss according to the second similarity sum, wherein the first loss and the second similarity sum are positively correlated. At this time, according to the first loss, the parameters of the model to be trained are updated, so that the similarity of the data with different categories determined by the model to be trained can be reduced, namely, the distance of the feature data with the same category is pulled, and therefore, the recognition accuracy of the model to be trained can be improved.

Optionally, the third data is the data with the highest similarity with the second data in the heterogeneous data set, wherein the category of the data in the heterogeneous data set is different from the category of the second data. For example, the training data set includes data a, data b, data c, data d, and second data, wherein the category of data b and the category of data d are different from the category of the second data. At this time, the heterogeneous data set includes data b and data d. If the similarity between the data b and the second data is lower than the similarity between the data d and the second data, the fourth data is the data d.

And under the condition that the fourth data is the data with the lowest similarity with the second data in the similar data set, the characteristic data of the fourth data is the data closest to the characteristic data of the second data in the heterogeneous data set, namely the third data is the data most likely to be identified by mistake in the heterogeneous data set. At this time, the data processing apparatus obtains the first loss according to the first similarity and the second similarity when the first similarity and the first loss are in negative correlation and the second similarity and the first loss are in positive correlation, which is equivalent to using the first similarity and the second similarity as an optimization direction of parameters of the model to be trained. Thus, according to the first loss, the parameters of the model to be trained are updated, the first similarity between the second data and the third data determined by the model to be trained can be improved, the second similarity between the second data and the fourth data determined by the model to be trained can be reduced, namely, the distance between the characteristic data of the second data and the characteristic data of the data most likely to be mistakenly identified in the similar data set is reduced, and the distance between the characteristic data of the second data and the characteristic data of the data most likely to be mistakenly identified in the heterogeneous data set is pulled, so that the identification accuracy of the model to be trained can be improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information, and obtains independent consent of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a kind of personal information to be processed.

The foregoing details the method of embodiments of the present application, and the apparatus of embodiments of the present application is provided below.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 1 includes: the acquiring unit 11, the first processing unit 12, the second processing unit 13, specifically:

an obtaining unit 11, configured to obtain a first feature data set of a data set to be annotated and a second feature data set of an annotated data set;

a first processing unit 12, configured to obtain a clustering result by performing clustering processing on the first feature data set and the second feature data set;

and the second processing unit 13 is configured to obtain a second category of the data set to be labeled according to the clustering result and the first category of the labeled data set.

In combination with any one of the embodiments of the present application, the first processing unit 12 is configured to:

In combination with any one of the embodiments of the present application, the second processing unit 13 is further configured to:

In combination with any one of the embodiments of the present application, the obtaining unit 11 is configured to:

identifying a fourth category of the first data using the classification model;

In combination with any embodiment of the present application, the obtaining unit 11 is further configured to obtain a model to be trained;

the second processing unit 13 is further configured to:

the second processing unit 13 is configured to:

determining a first similarity of the second data and the third data;

the second processing unit 13 is further configured to:

determining a second similarity of the second data and the fourth data;

In some embodiments, functions or modules included in the apparatus provided in the embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Fig. 3 is a schematic hardware structure of an electronic device according to an embodiment of the present application. The electronic device 2 comprises a processor 21 and a memory 22. Optionally, the electronic device 2 further comprises input means 23 and output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors, including various interfaces, transmission lines or buses, etc., as not limited in this application. It should be understood that in various embodiments of the present application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 21 may comprise one or more processors, for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single core CPU or a multi core CPU. Alternatively, the processor 21 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the present application is not limited.

Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It will be appreciated that in the embodiment of the present application, the memory 22 may be used to store not only related instructions, but also related data, for example, the memory 22 may be used to store a first feature data set of a to-be-annotated data set and a second feature data set of an annotated data set obtained through the input device 23, or the memory 22 may also be used to store a second class of to-be-annotated data sets obtained through the processor 21, etc., and the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that fig. 3 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all electronic devices that may implement the embodiments of the present application are within the scope of protection of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments herein are provided with emphasis, and that the same or similar parts may not be explicitly described in different embodiments for the sake of convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in the description of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (digital versatile disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method of data processing, the method comprising:

2. The method according to claim 1, wherein the clustering result is obtained by performing clustering processing on the first feature data set and the second feature data set, and the method comprises:

3. The method according to claim 2, wherein the clustering result is obtained by performing clustering processing on the second feature data set and the third feature data set, including:

4. A method according to any one of claims 1 to 3, characterized in that the method further comprises:

5. A method according to any one of claims 1 to 3, wherein said obtaining a second feature dataset of the annotated dataset comprises:

identifying a fourth category of the first data using the classification model;

6. A method according to any one of claims 1 to 3, wherein the data in the set of data to be annotated and the data in the annotated set of data are both images.

7. The method of claim 6, wherein the first category and the second category are both merchandise categories.

8. A method according to any one of claims 1 to 3, characterized in that the method further comprises:

obtaining a model to be trained;

9. The method of claim 8, wherein the training data set comprises second data and third data, the second data being of the same class as the third data;

determining a first similarity of the second data and the third data;

10. The method of claim 9, wherein the training data set further comprises fourth data, the second data having a category different from a category of the fourth data;

determining a second similarity of the second data and the fourth data;

11. A data processing apparatus, the apparatus comprising:

12. An electronic device, comprising: a processor and a memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 10.

13. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 10.