CN118152519A

CN118152519A - Sample cleaning method and device, electronic equipment and storage medium

Info

Publication number: CN118152519A
Application number: CN202410332295.6A
Authority: CN
Inventors: 张章伟; 安鹏; 沙爱晖; 周斌
Original assignee: Shanghai Shizhuang Information Technology Co ltd
Current assignee: Shanghai Shizhuang Information Technology Co ltd
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-06-07

Abstract

The application discloses a sample cleaning method, a sample cleaning device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence. Wherein the method comprises the following steps: carrying out semantic clustering processing on samples in the corpus to obtain a sample set of a plurality of clustering categories; determining a label category contained in a sample set of each cluster category, and dividing the sample set of each cluster category into a plurality of sample subsets based on the label category; and determining a target sample subset with labeling errors from a plurality of sample subsets aiming at the sample set of each cluster category, thereby realizing the sample cleaning task of the corpus. According to the technical scheme provided by the application, the sample marked with the error in the semantic library can be positioned relatively quickly, the determined sample marked with the error has relatively high accuracy, and the cost investment of time and manpower is effectively reduced.

Description

Sample cleaning method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for cleaning a sample, an electronic device, and a storage medium.

Background

The quality of text labels in the corpus has important influence on the result of the text model, and the high-quality corpus data can help the text model to obtain better business results after training, so that accurate and unique text label results in the corpus are generally required in text classification tasks. However, in actual business, the text in the corpus has wide sources, rapid speed increase, inaccurate labeling and inconsistent labeling results before and after the same text, and the labeling results can have adverse effects on the effect after model training. Therefore, the corpus needs to be cleaned so as to improve the data quality of the corpus, and further improve the business effect after model training.

The following methods are commonly used in the prior art to clean corpora: and firstly, performing quality inspection on the marked text again. The method can find the sample marked with the error, but is equivalent to the second marking, and the labor and time input cost is high. And secondly, batched texts after marking is completed, and randomly sampling marking quality inspection is carried out on each batch of data. The method avoids full quality inspection of the text, but when the qualification rate of each batch of data is low and is quite equivalent to full weight re-labeling, the cost investment is equivalent to that of the method. Thirdly, assuming that the low-quality text is easy to have marking errors, and rejecting the low-quality text in the corpus by adopting a classification model for identifying the text quality. The method can obtain a corpus with fewer labeling errors, but cannot judge whether the text with high quality and wrong labeling is wrong; because the low-quality text has a certain service value, rejecting the text does not meet the service requirement, and additionally training a classification model for judging the quality of the text has higher cost. Therefore, designing a sample cleaning method with high accuracy and high efficiency, which can reduce the input cost of manpower and time, is a problem to be solved.

Disclosure of Invention

The application provides a sample cleaning method, a sample cleaning device, electronic equipment and a storage medium, which can quickly locate samples marked with errors in a semantic library, and the determined samples marked with errors have high accuracy, so that the cost investment of time and manpower is effectively reduced.

In a first aspect, the present application provides a method of cleaning a sample, the method comprising:

carrying out semantic clustering processing on samples in the corpus to obtain a sample set of a plurality of clustering categories;

determining a label category contained in a sample set of each cluster category, and dividing the sample set of each cluster category into a plurality of sample subsets based on the label category;

and determining a target sample subset with labeling errors from the plurality of sample subsets aiming at the sample set of each cluster category, thereby realizing the sample cleaning task of the corpus.

Further, the semantic clustering processing is performed on the samples in the language database to obtain a sample set of a plurality of clustering categories, including: determining a plurality of cluster features and weights corresponding to each cluster feature; carrying out semantic clustering processing on the samples based on the plurality of cluster features and the weights corresponding to each cluster feature to obtain sample sets of the plurality of cluster categories; wherein the cluster features include at least one of text edit distance, keyword semantic similarity, and text semantic similarity; the sample set comprises a plurality of samples and labeling labels corresponding to each sample.

Further, for the sample set of the current cluster category, the determining, from the plurality of sample subsets, the target sample subset with the labeling error includes: determining a sample number for each sample subset; if the sample numbers of the plurality of sample subsets are the same, determining that a target sample subset with wrong labeling does not exist in the sample set of the current cluster type; if the sample numbers of the plurality of sample subsets are different, determining the maximum sample number and the minimum sample number according to the sample data of the plurality of sample subsets; and sequentially calculating whether each sample subset is a target sample subset with wrong labeling or not based on the sample number of each sample subset, the maximum sample number and the minimum sample number, so as to obtain the target sample subset with wrong labeling in the sample set of the current cluster category.

Further, the sequentially calculating, based on the number of samples of each sample subset, the maximum number of samples, and the minimum number of samples, whether each sample subset is a target sample subset with a labeling error includes: calculating, for a current sample subset, a false labeling probability for the current sample subset based on a sample number of the current sample subset, the maximum sample number, and the minimum sample number; if the error labeling probability is larger than a first preset value, determining that the current sample subset is a target sample subset with labeling errors; and traversing the plurality of sample subsets to obtain the error labeling probability of each sample subset, thereby obtaining whether each sample subset is a target sample subset with labeling errors.

Further, calculating the error labeling probability of the current sample subset based on the sample number of the current sample subset, the maximum sample number and the minimum sample number through the following formula:

where S represents the probability of error labeling, i represents the index number of the current sample subset among the plurality of sample subsets, m _i represents the number of samples of the current sample subset, m1 represents the maximum number of samples, and m2 represents the minimum number of samples.

Further, the determining, from the plurality of sample subsets, the target sample subset having the labeling error includes: determining a sample number for each sample subset; if the number of the samples is smaller than a second preset value, determining that a sample subset corresponding to the number of the samples is a target sample subset with labeling errors.

Further, before the semantic clustering processing is performed on the samples in the corpus to obtain a sample set of a plurality of clustering categories, the method further includes: and carrying out format conversion on the samples in the corpus based on a preset format to obtain processed samples.

In a second aspect, the present application provides a sample cleaning device comprising:

The sample clustering module is used for carrying out semantic clustering processing on samples in the corpus to obtain sample sets of a plurality of clustering categories;

The sample dividing module is used for determining label categories contained in a sample set of each cluster category, and dividing the sample set of each cluster category into a plurality of sample subsets based on the label categories;

And the sample cleaning module is used for determining a target sample subset with labeling errors from the plurality of sample subsets aiming at the sample set of each clustering category, so as to realize the sample cleaning task of the corpus.

In a third aspect, the present application provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of cleaning a sample according to any of the embodiments of the present application.

In a fourth aspect, the present application provides a computer readable storage medium storing computer instructions for causing a processor to perform a method for cleaning a sample according to any embodiment of the present application.

In order to solve the defects of the prior art in the background technology, the embodiment of the application provides a sample cleaning method, and the execution of the method can bring the following beneficial effects: the method comprises the steps of firstly clustering samples in a semantic library into a plurality of sample sets, then dividing each sample set into corresponding sample subsets according to label categories contained in the sample sets, and finally analyzing each sample subset to determine whether labeling errors exist in each sample subset. The method can quickly locate the sample marked with the error in the semantic library, and the determined sample marked with the error has higher accuracy, so that the cost investment of time and manpower is effectively reduced.

It should be noted that the above-mentioned computer instructions may be stored in whole or in part on a computer-readable storage medium. The computer readable storage medium may be packaged together with the processor of the sample cleaning device, or may be packaged separately from the processor of the sample cleaning device, which is not limited in the present application.

The description of the second, third and fourth aspects of the present application may refer to the detailed description of the first aspect; moreover, the advantages described in the second aspect, the third aspect and the fourth aspect may refer to the analysis of the advantages of the first aspect, and are not described herein.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

It can be understood that before using the technical solutions disclosed in the embodiments of the present application, the user should be informed and authorized by appropriate ways according to relevant laws and regulations for the type, usage range, usage scenario, etc. of the personal information related to the present application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for cleaning a sample according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a second flow chart of a method for cleaning a sample according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a sample cleaning device according to an embodiment of the present application;

Fig. 4 is a block diagram of an electronic device for implementing a method for cleaning a sample according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," "target," and "original," etc. in the description and claims of the present application and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be capable of executing sequences other than those illustrated or otherwise described. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a schematic flow chart of a sample cleaning method according to an embodiment of the present application, where the embodiment is applicable to a situation in which samples in a corpus are cleaned. The method for cleaning a sample provided in the present embodiment may be performed by the apparatus for cleaning a sample provided in the present embodiment, where the apparatus may be implemented in software and/or hardware, and integrated in an electronic device for performing the method.

Referring to fig. 1, the method of the present embodiment includes, but is not limited to, the following steps:

S110, carrying out semantic clustering processing on samples in the language database to obtain a sample set of a plurality of clustering categories.

The corpus is the corpus after the first labeling, and the sample set comprises a plurality of samples and labeling labels corresponding to each sample.

Further, since there may be samples in different formats in the corpus, before performing semantic clustering processing on the samples in the corpus to obtain a sample set of multiple clustering categories, the method further includes: and carrying out unified format conversion on the samples in the corpus based on a preset format to obtain processed samples, so that the corpus is manufactured into samples to be clustered, and the samples are presented in a text labeling mode.

Specifically, performing semantic clustering processing on samples in a corpus to obtain a sample set of a plurality of clustering categories, including: determining a plurality of cluster features and weights corresponding to each cluster feature; and carrying out semantic clustering processing on the samples based on the plurality of clustering features and the weight corresponding to each clustering feature to obtain sample sets of a plurality of clustering categories. Wherein the clustering features include at least one of text edit distance, keyword semantic similarity, and text semantic similarity; according to a preferred embodiment of the application, the clustering features may include, for example, text edit distance, keyword semantic similarity, and text semantic similarity.

In an embodiment of the content security service scenario, the weight of the text editing distance may be set to 0.3, the weight of the keyword semantic similarity may be set to 0.3, and the weight of the text semantic similarity may be set to 0.4. The clustering features and the corresponding weights of the embodiment can be adjusted according to different service scenes. Compared with the clustering method using text semantics directly, the clustering method has higher accuracy and is suitable for content security business scenes.

Optionally, the text editing distance and the semantic similarity score may use, for example, a K-Nearest Neighbor (kNN) or other clustering methods to perform semantic clustering on samples in the corpus, and the more accurate clustering method may be further reduced to a range of samples to be inspected in the semantic library, so as to reduce quality inspection cost investment.

For example, assume that there are 1 ten thousand samples in a corpus, each sample is composed of text and corresponding labels, and the labels are "a, b, c, d, e, f", respectively. Different weights are distributed to the text editing distance, the keyword meaning similarity and the text meaning similarity of the text, and a sample set of 100 clustering categories is generated through semantic clustering processing, and the labels are respectively 0 to 99.

S120, determining label categories contained in the sample set of each cluster category, and dividing the sample set of each cluster category into a plurality of sample subsets based on the label categories.

Wherein the number of sample subsets is the same as the number of tag categories contained in the sample set, one tag category corresponding to each sample subset.

Illustratively, suppose that statistics result in a total of 500 samples in the sample set numbered 0, with five tag categories, "a, b, c, d, e," respectively. Then the 500 samples are divided into five sample subsets by label category.

S130, determining a target sample subset with labeling errors from a plurality of sample subsets according to the sample set of each clustering class, so as to realize sample cleaning tasks of the corpus.

In the embodiment of the application, aiming at a sample set of a current clustering category, determining a target sample subset with labeling errors from a plurality of sample subsets in the sample set of the current clustering category; and traversing the sample sets of all the cluster categories to obtain target sample subsets with labeling errors in each sample set, thereby realizing the sample cleaning task of the corpus. The target sample subset is a sample to be inspected in the semantic library.

Specifically, determining a target sample subset with labeling errors from a plurality of sample subsets includes: determining a sample number for each sample subset; if the number of the samples is smaller than a second preset value, determining that the sample subset corresponding to the number of the samples is a target sample subset with marking errors.

In the embodiment of the application, the text with similar semantics is generally considered to have the same label in the field, so that samples in a corpus can be clustered, and the sample number distribution of each label in each cluster is counted. The art generally considers that labels with a smaller number of samples in a cluster have a high probability that the corresponding text is the text with the wrong label, and quality inspection is needed. The second preset value can be set according to actual application requirements.

Illustratively, it is assumed that statistics result in a total of 500 samples in the sample set numbered 0, five label categories, "a, b, c, d, e", respectively, wherein the number of samples of the sample subset of the a-label is 300, the number of samples of the sample subset of the b-label is 100, the number of samples of the sample subset of the c-label is 80, the number of samples of the sample subset of the d-label is 10, and the number of samples of the sample subset of the e-label is 10. In this case, samples corresponding to d labels and e labels in the cluster category have a high probability of being samples marked with errors, and quality inspection should be performed on the samples. Based on the same method, the sample set with the clustering category from 0 to 99 is traversed, and samples with high probability of labeling errors can be screened out.

According to the technical scheme provided by the embodiment, a sample set of a plurality of clustering categories is obtained by carrying out semantic clustering processing on samples in a corpus; determining a label category contained in a sample set of each cluster category, and dividing the sample set of each cluster category into a plurality of sample subsets based on the label category; and determining a target sample subset with labeling errors from a plurality of sample subsets aiming at the sample set of each cluster category, thereby realizing the sample cleaning task of the corpus. The method comprises the steps of firstly clustering samples in a semantic library into a plurality of sample sets, then dividing each sample set into corresponding sample subsets according to label categories contained in the sample sets, and finally analyzing each sample subset to determine whether labeling errors exist in each sample subset. The method can quickly locate the sample marked with the error in the semantic library, and the determined sample marked with the error has higher accuracy, so that the cost investment of time and manpower is effectively reduced.

The method for cleaning a sample according to the embodiment of the present application is further described below, and fig. 2 is a schematic second flow chart of the method for cleaning a sample according to the embodiment of the present application. The embodiment of the application is optimized based on the embodiments, and is specifically optimized as follows: for a sample set of a current cluster category, the present embodiment explains in detail a process of determining a target sample subset with a labeling error from a plurality of sample subsets.

Referring to fig. 2, the method of the present embodiment includes, but is not limited to, the following steps:

s210, determining the sample number of each sample subset.

In the embodiment of the application, semantic clustering processing is carried out on samples in a corpus to obtain sample sets of a plurality of clustering categories, label categories contained in the sample set of each clustering category are determined, and the sample set of each clustering category is divided into a plurality of sample subsets based on the label categories. For each sample set of the cluster category, a number of samples for each sample subset is determined, and a determination is made as to whether the number of samples for each sample subset is the same.

S220, if the sample numbers of the plurality of sample subsets are the same, determining that the target sample subset with the wrong labeling does not exist in the sample set of the current clustering type.

In the embodiment of the application, when the numbers of the samples of the plurality of sample subsets are the same, the error labeling probability of the corresponding sample set is 0, and the samples without labeling errors in the sample set are indicated.

S230, if the sample numbers of the sample subsets are different, determining the maximum sample number and the minimum sample number according to the sample data of the sample subsets.

In the embodiment of the application, when the sample numbers of the plurality of sample subsets are different, determining which label class sample subset has the largest sample number, and obtaining the largest sample number; the determination of which label class has the least number of samples of the sample subset results in the least number of samples.

S240, calculating the error labeling probability of the current sample subset based on the sample number, the maximum sample number and the minimum sample number of the current sample subset aiming at the current sample subset.

Specifically, the error labeling probability of the current sample subset is calculated according to the following formula based on the sample number, the maximum sample number and the minimum sample number of the current sample subset:

Where S represents the error labeling probability (corresponding to the label error rate score), which is a normalized score of the number of labels in the sample set of each cluster class, i represents the index number of the current sample subset among the plurality of sample subsets, m _i represents the number of samples of the current sample subset, m1 represents the maximum number of samples, and m2 represents the minimum number of samples.

The larger the value of S for a sample subset, the greater the probability that the sample subset has a false label (i.e., the sample subset is the target sample subset), and vice versa. The probability of false labeling for a sample subset is related to the number of label categories within the sample subset.

S250, if the error labeling probability is larger than a first preset value, determining that the current sample subset is a target sample subset with labeling errors.

The first preset value may be set according to different practical application scenarios, and according to a preferred embodiment of the present application, in a traffic safety scenario, the first preset value may be set to 0.8, for example. Experiments prove that the range of the sample to be inspected, which is determined by adopting the traditional method, accounts for 50 to 100 percent of the total amount of the semantic library samples. Under the traffic safety scene, the quality inspection is carried out on the sample subset with the error labeling probability larger than 0.8, and finally, only a small part of samples (about 10% of the total amount of the samples of the semantic library and more than 50% of labor cost are saved compared with the traditional method) are inspected, so that the corpus can finally reach the sample accuracy (the general accuracy is more than 99.5%) which is the same as that of the traditional method, and the cost investment for cleaning the corpus is obviously reduced.

S260, traversing the plurality of sample subsets to obtain the error labeling probability of each sample subset, thereby obtaining the target sample subset with labeling errors in the sample set of the current cluster category.

In the corresponding embodiment of fig. 1, when the number of samples of a certain sample subset is smaller than the second preset value, the sample subset is considered to have a labeling error. Because the number of label categories in the sample set of each cluster category is different, it is difficult to uniformly set the value of the second preset value to the sample set of each cluster category, and then the method for calculating the error labeling probability of the sample subset in the embodiment can also be used for judging whether the sample subset has labeling errors.

According to the technical scheme provided by the embodiment, the sample number of each sample subset is determined; if the sample numbers of the plurality of sample subsets are the same, determining that the target sample subset with the wrong labeling does not exist in the sample set of the current clustering type; if the sample numbers of the plurality of sample subsets are different, determining the maximum sample number and the minimum sample number according to the sample data of the plurality of sample subsets; calculating error labeling probability of the current sample subset based on the sample number, the maximum sample number and the minimum sample number of the current sample subset for the current sample subset; if the error labeling probability is larger than the first preset value, determining that the current sample subset is a target sample subset with labeling errors; and traversing the plurality of sample subsets to obtain the error labeling probability of each sample subset, thereby obtaining the target sample subset with labeling errors in the sample set of the current cluster category. According to the method, whether the labeling errors exist in each sample subset is judged by calculating the error labeling probability of each sample subset, and compared with a traditional sample cleaning method, the number of target sample subsets (i.e. samples to be tested) can be reduced; the method can also quickly locate the sample marked with the error in the semantic library, and the determined sample marked with the error has higher accuracy, thereby effectively reducing the cost investment of time and manpower.

Fig. 3 is a schematic structural diagram of a sample cleaning device according to an embodiment of the present application, and as shown in fig. 3, the device 300 may include:

the sample clustering module 310 is configured to perform semantic clustering on samples in the corpus to obtain a sample set of a plurality of clustering categories;

A sample dividing module 320, configured to determine a label category included in a sample set of each cluster category, and divide the sample set of each cluster category into a plurality of sample subsets based on the label category;

The sample cleaning module 330 is configured to determine, for the sample set of each cluster category, a target sample subset with a labeling error from the plurality of sample subsets, so as to implement a sample cleaning task for the corpus.

Further, the sample clustering module 310 may be specifically configured to: determining a plurality of cluster features and weights corresponding to each cluster feature; carrying out semantic clustering processing on the samples based on the plurality of cluster features and the weights corresponding to each cluster feature to obtain sample sets of the plurality of cluster categories; wherein the cluster features include at least one of text edit distance, keyword semantic similarity, and text semantic similarity; the sample set comprises a plurality of samples and labeling labels corresponding to each sample.

Further, the sample cleaning module 330 may be specifically configured to: determining the number of samples of each sample subset for a sample set of the current cluster category; if the sample numbers of the plurality of sample subsets are the same, determining that a target sample subset with wrong labeling does not exist in the sample set of the current cluster type; if the sample numbers of the plurality of sample subsets are different, determining the maximum sample number and the minimum sample number according to the sample data of the plurality of sample subsets; and sequentially calculating whether each sample subset is a target sample subset with wrong labeling or not based on the sample number of each sample subset, the maximum sample number and the minimum sample number, so as to obtain the target sample subset with wrong labeling in the sample set of the current cluster category.

Further, the sample cleaning module 330 may be specifically configured to: calculating, for a current sample subset, a false labeling probability for the current sample subset based on a sample number of the current sample subset, the maximum sample number, and the minimum sample number; if the error labeling probability is larger than a first preset value, determining that the current sample subset is a target sample subset with labeling errors; and traversing the plurality of sample subsets to obtain the error labeling probability of each sample subset, thereby obtaining whether each sample subset is a target sample subset with labeling errors.

Further, the sample cleaning module 330 may be specifically configured to: calculating the error labeling probability of the current sample subset based on the sample number of the current sample subset, the maximum sample number and the minimum sample number through the following formula:

Further, the sample cleaning module 330 may be specifically configured to: determining a sample number for each sample subset; if the number of the samples is smaller than a second preset value, determining that a sample subset corresponding to the number of the samples is a target sample subset with labeling errors.

Further, the above-mentioned cleaning device for a sample may further include: a sample processing module;

The sample processing module is used for carrying out format conversion on the samples in the corpus based on a preset format before carrying out semantic clustering processing on the samples in the corpus to obtain sample sets of a plurality of clustering categories.

The sample cleaning device provided by the embodiment is applicable to the sample cleaning method provided by any embodiment, and has corresponding functions and beneficial effects.

Fig. 4 is a block diagram of an electronic device for implementing a method for cleaning a sample according to an embodiment of the present application. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as a method of cleaning a sample.

In some embodiments, the method of cleaning a sample may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the sample washing method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the method of cleaning the sample in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present application may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present application, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server) or that includes a middleware component (e.g., an application server) or that includes a front-end component through which a user can interact with an implementation of the systems and techniques described here, or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

Note that the above is only a preferred embodiment of the present application and the technical principle applied. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. For example, one skilled in the art may use the various forms of flow shown above to reorder, add, or delete steps; the steps recited in the present application may be performed in parallel, sequentially or in a different order, and are not limited herein as long as the desired results of the technical solution of the present application can be achieved.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of cleaning a sample, the method comprising:

2. The method for cleaning samples according to claim 1, wherein the performing semantic clustering on the samples in the library to obtain a sample set of a plurality of clustering categories includes:

determining a plurality of cluster features and weights corresponding to each cluster feature;

carrying out semantic clustering processing on the samples based on the plurality of cluster features and the weights corresponding to each cluster feature to obtain sample sets of the plurality of cluster categories;

Wherein the cluster features include at least one of text edit distance, keyword semantic similarity, and text semantic similarity; the sample set comprises a plurality of samples and labeling labels corresponding to each sample.

3. The method for cleaning samples according to claim 1, wherein the determining, for the sample set of the current cluster category, the target sample subset having the labeling error from the plurality of sample subsets includes:

determining a sample number for each sample subset;

If the sample numbers of the plurality of sample subsets are the same, determining that a target sample subset with wrong labeling does not exist in the sample set of the current cluster type;

If the sample numbers of the plurality of sample subsets are different, determining the maximum sample number and the minimum sample number according to the sample data of the plurality of sample subsets;

And sequentially calculating whether each sample subset is a target sample subset with wrong labeling or not based on the sample number of each sample subset, the maximum sample number and the minimum sample number, so as to obtain the target sample subset with wrong labeling in the sample set of the current cluster category.

4. The method for cleaning samples according to claim 3, wherein sequentially calculating whether each sample subset is a target sample subset with a labeling error based on the sample number of each sample subset, the maximum sample number and the minimum sample number comprises:

calculating, for a current sample subset, a false labeling probability for the current sample subset based on a sample number of the current sample subset, the maximum sample number, and the minimum sample number;

If the error labeling probability is larger than a first preset value, determining that the current sample subset is a target sample subset with labeling errors;

and traversing the plurality of sample subsets to obtain the error labeling probability of each sample subset, thereby obtaining whether each sample subset is a target sample subset with labeling errors.

5. The method of cleaning samples according to claim 4, wherein the probability of false labeling of the current sample subset is calculated based on the number of samples of the current sample subset, the maximum number of samples, and the minimum number of samples by the following formula:

6. The method for cleaning samples according to claim 1, wherein determining a target sample subset having a labeling error from the plurality of sample subsets comprises:

determining a sample number for each sample subset;

if the number of the samples is smaller than a second preset value, determining that a sample subset corresponding to the number of the samples is a target sample subset with labeling errors.

7. The method for cleaning samples according to claim 1, further comprising, before the semantic clustering of the samples in the corpus to obtain a sample set of a plurality of cluster categories:

And carrying out format conversion on the samples in the corpus based on a preset format to obtain processed samples.

8. A sample device in a corpus, the device comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of cleaning a sample as claimed in any one of claims 1 to 7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of cleaning a sample according to any one of claims 1 to 7.