CN118013188A - Method, device, equipment and storage medium for processing noise data - Google Patents

Method, device, equipment and storage medium for processing noise data Download PDF

Info

Publication number
CN118013188A
CN118013188A CN202211396523.3A CN202211396523A CN118013188A CN 118013188 A CN118013188 A CN 118013188A CN 202211396523 A CN202211396523 A CN 202211396523A CN 118013188 A CN118013188 A CN 118013188A
Authority
CN
China
Prior art keywords
data
enterprise
training
noise data
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211396523.3A
Other languages
Chinese (zh)
Inventor
潘利星
余电
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN202211396523.3A priority Critical patent/CN118013188A/en
Publication of CN118013188A publication Critical patent/CN118013188A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The application is applicable to the technical field of data processing, and provides a method, a device, equipment and a storage medium for processing noise data, wherein the method comprises the following steps: acquiring noise data screened from a plurality of enterprise data; determining information error data and label error data in the noise data; and removing the information error data and updating the label error data to obtain target noise data, wherein the target noise data is used for training the enterprise classification model. In this embodiment, the information error data and the label error data in the noise data are determined, and the information error data which cannot be corrected is removed, so that the finally obtained target noise data has no interference of the information error data, and the model accuracy is improved when the enterprise classification model is trained by using the target noise data. The label error data is updated, so that the updated label error data can be used for training the enterprise classification model, and the guarantee is provided for the number of samples required by the subsequent training of the enterprise classification model.

Description

Method, device, equipment and storage medium for processing noise data
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing noise data.
Background
The purposeful industry class identification of enterprises is an important part of business analysis and investment decision-making processes. The essence of the business category identification is that the related data of the enterprise are classified, so that the business category of the enterprise is determined according to the classification result.
At present, related data of an enterprise is generally classified through a trained classification model, so that an industry class of the enterprise is obtained. In training the classification model, the quality of the training data is critical to the learning effect of the classification model.
High-performance classification models rely on a large number of high-quality labeled training data, the quality of which is highly dependent on manual labeling. The higher the quality of the annotation, the greater the difficulty of the annotation. Thus, large data aggregation is data that contains a large amount of noise data, such as tag errors.
Noise data in training data is screened and filtered through confidence learning in the prior art, and the screened noise data is inaccurate.
Disclosure of Invention
In view of the above, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for processing noise data, so as to solve the problem that noise data screened in the prior art is inaccurate, data that originally does not belong to noise data is determined as noise data, and then all the noise data are removed, which affects the quantity and quality of training data, and finally results in inaccurate trained classification models.
A first aspect of an embodiment of the present application provides a method of processing noise data, the method comprising:
Acquiring noise data screened from a plurality of enterprise data, wherein the noise data is provided with a label, and the label is used for identifying the industry category of an enterprise;
Determining information error data and label error data in the noise data, wherein the information error data is data with confidence coefficient smaller than a preset threshold value, and the label error data is data except the information error data in the noise data;
and aiming at the noise data, eliminating information error data and updating label error data to obtain target noise data, wherein the target noise data is used for training an enterprise classification model.
In the scheme, on the basis of the screened noise data, the information error data and the label error data in the noise data are further determined, and the information error data which cannot be corrected are removed, so that the finally obtained target noise data do not have the data. Because the finally obtained target noise data has no interference of information error data, the accuracy of the enterprise classification model is improved when the enterprise classification model is trained by utilizing the target noise data. In the prior art, noise data can be removed, but the data volume of data such as general label error data is large, and the removal of the data can greatly influence the sample number of a subsequent training enterprise classification model. In the embodiment, the label error data is not directly removed but updated, so that the updated label error data can be used for training the enterprise classification model, and the number of samples required for training the enterprise classification model by using the target noise data later is ensured.
In the embodiment, on the basis of the screened noise data, the information error data and the label error data in the noise data are further determined, so that the noise data is analyzed in a deeper level, and the rejected information error data belongs to real noise data. The label error data is not true noise data, the label error data can be reused after being updated, reasonable utilization of the data is realized, a large number of samples are not required to be acquired when an enterprise classification model is trained later, and therefore resources are saved and training cost is saved.
Optionally, determining the information error data and the tag error data in the noise data includes: acquiring enterprise related information of each data in the noise data; according to the related information of each enterprise, calculating text information entropy, text similarity and/or vector distance of each data; determining the confidence coefficient of each data according to the text information entropy, the text similarity and the vector distance; determining the data with the confidence coefficient smaller than a preset threshold value as information error data; and determining the data with the confidence coefficient larger than or equal to a preset threshold value as the label error data.
Optionally, removing the information error data and updating the tag error data to obtain target noise data, including: removing information error data from noise data; determining label updatable data and label non-updatable data in the label error data according to a preset updating strategy; updating the label of the label updatable data to obtain updated data; and determining target noise data according to the updating data and the label non-updating data.
Optionally, before acquiring noise data screened out from the plurality of enterprise data, the method further comprises: and screening noise data from the enterprise data by using confidence learning and a preset screening strategy.
Optionally, each enterprise data has an original tag, and filtering noise data from the plurality of enterprise data by using confidence learning and a preset filtering strategy includes: processing a plurality of enterprise data by using the constructed classification model to obtain the prediction probability of each enterprise data; predicting the real label of each enterprise data according to the prediction probability of each enterprise data; estimating joint probability distribution of the original tag and the real tag according to the original tag and the real tag of each enterprise data; and screening enterprise data conforming to a screening strategy from the plurality of enterprise data based on the joint probability distribution to obtain noise data, wherein the screening strategy comprises a noise rate screening strategy and/or an industry category screening strategy.
Optionally, after removing the information error data and updating the tag error data to obtain the target noise data, the method further includes: and performing M-turn training on the basic model by utilizing the target noise data to obtain an enterprise classification model set, wherein M is a positive integer, the enterprise classification model set comprises M enterprise classification models, and the basic models adopted by each turn of training are different.
Optionally, performing M-round training on the basic model by using the target noise data to obtain an enterprise classification model set, including: determining a training sample set corresponding to the ith training, wherein i is a positive integer and is sequentially increased, i is less than or equal to M, and the training sample set adopted by each training is different; determining a basic model corresponding to the ith training; training the basic model corresponding to the ith training according to the training sample set corresponding to the ith training to obtain an enterprise classification model corresponding to the ith training; and forming an enterprise classification model set according to the enterprise classification model obtained by training each round.
Optionally, determining a training sample set corresponding to the ith training, including: acquiring non-noise data in the plurality of enterprise data when i=1; forming a training sample set according to the non-noise data and the target noise data; when i is not equal to 1, determining non-noise data corresponding to the ith training; determining target noise data corresponding to the ith training; and forming a training sample set corresponding to the ith wheel training according to the non-noise data corresponding to the ith wheel training and the target noise data corresponding to the ith wheel training.
Optionally, determining target noise data corresponding to the ith training includes: determining noise data in a training sample set corresponding to the i-1 th round of training; determining information error data and label error data in noise data corresponding to the i-1 th training; and eliminating information error data in noise data corresponding to the ith wheel training, and updating label error data in the noise data corresponding to the ith wheel training to obtain target noise data corresponding to the ith wheel training.
Optionally, the method further comprises: in the ith training process, calculating the joint distribution probability of a training sample set used in the ith training; according to the joint distribution probability of the ith training, the weight of each data in the training sample set used in the ith training is adjusted, and the adjusted weight is used for adjusting the weight of the loss function in the (i+1) th training.
Optionally, after performing M-round training on the basic model by using the target noise data to obtain the enterprise classification model set, the method further includes: acquiring data to be classified of an enterprise; and inputting the data to be classified into an enterprise classification model set for processing to obtain an enterprise industry classification result.
Optionally, inputting the data to be classified into the enterprise classification model set for processing to obtain an enterprise industry classification result, including: predicting data to be classified through each enterprise classification model in the enterprise classification model set to obtain a plurality of prediction results; obtaining model weights corresponding to each enterprise classification model; and determining an industry classification result of the enterprise according to the plurality of prediction results and the weight of each model.
A second aspect of an embodiment of the present application provides an apparatus for processing noise data, including:
The acquisition unit is used for acquiring noise data screened from the enterprise data, wherein the noise data is provided with a label, and the label is used for identifying the industry category of the enterprise;
the determining unit is used for determining information error data and label error data in the noise data, wherein the information error data is data with confidence coefficient smaller than a preset threshold value, and the label error data is data except the information error data in the noise data;
The processing unit is used for eliminating information error data and updating label error data aiming at the noise data to obtain target noise data, wherein the target noise data is used for training an enterprise classification model.
A third aspect of an embodiment of the application provides an apparatus for processing noise data, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method for processing noise data as described in the first aspect above when executing the computer program.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of processing noise data as described in the first aspect above.
A fifth aspect of an embodiment of the application provides a computer program product for causing a device for processing noise data to carry out the steps of the method for processing noise data as described in the first aspect above when the computer program product is run on the device.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method of processing noise data provided by an exemplary embodiment of the application;
Fig. 2 is a specific flowchart of step S102 of a method of processing noise data according to another exemplary embodiment of the present application;
fig. 3 is a specific flowchart of step S103 of a method of processing noise data according to still another exemplary embodiment of the present application;
FIG. 4 is a specific flow chart of a method of processing noise data according to yet another exemplary embodiment of the present application;
FIG. 5 is a particular flow chart of a method of processing noise data according to yet another exemplary embodiment of the application;
FIG. 6 is a schematic diagram of an apparatus for processing noise data according to an embodiment of the present application;
Fig. 7 is a schematic diagram of an apparatus for processing noise data according to another embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The purposeful industry class identification of enterprises is an important part of business analysis and investment decision-making processes. The essence of the business category identification is that the related data of the enterprise are classified, so that the business category of the enterprise is determined according to the classification result.
At present, related data of an enterprise is generally classified through a trained classification model, so that an industry class of the enterprise is obtained. In training the classification model, the quality of the training data is critical to the learning effect of the classification model.
High-performance classification models rely on a large number of high-quality labeled training data, the quality of which is highly dependent on manual labeling. The higher the quality of the annotation, the greater the difficulty of the annotation. Thus, large data aggregation is data that contains a large amount of noise data, such as tag errors.
Noise data in training data is typically screened out by confidence learning in the prior art, and is removed from the training data. However, the noise data screened through the confidence learning is inaccurate, namely, the data which does not originally belong to the noise data is judged to be the noise data through the method, and then the noise data are all removed, so that the quantity and quality of training data are affected, and finally, the trained classification model is inaccurate.
In view of this, an embodiment of the present application provides a method for processing noise data, by acquiring noise data screened from a plurality of enterprise data; determining information error data and label error data in the noise data; and removing the information error data and updating the label error data to obtain target noise data, wherein the target noise data is used for training the enterprise classification model. In this embodiment, on the basis of the noise data thus screened, information error data and tag error data in the noise data are further determined, and the information error data which cannot be corrected is removed, so that the target noise data finally obtained is free of such data. Because the finally obtained target noise data has no interference of information error data, the accuracy of the enterprise classification model is improved when the enterprise classification model is trained by utilizing the target noise data. In the prior art, noise data can be removed, but the data volume of data such as general label error data is large, and the removal of the data can greatly influence the sample number of a subsequent training enterprise classification model. In the embodiment, the label error data is not directly removed but updated, so that the updated label error data can be used for training the enterprise classification model, and the number of samples required for training the enterprise classification model by using the target noise data later is ensured.
In the embodiment, on the basis of the screened noise data, the information error data and the label error data in the noise data are further determined, so that the noise data is analyzed in a deeper level, and the rejected information error data belongs to real noise data. The label error data is not true noise data, the label error data can be reused after being updated, reasonable utilization of the data is realized, a large number of samples are not required to be acquired when an enterprise classification model is trained later, and therefore resources are saved and training cost is saved.
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for processing noise data according to an exemplary embodiment of the application. The execution body of the method for processing noise data provided by the application is equipment for processing noise data, wherein the equipment comprises, but is not limited to, a vehicle-mounted computer, a tablet Personal computer, a computer, an intelligent wearable device, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA) and other equipment, and can also comprise various types of servers. For example, the server may be a stand-alone server, or may be a cloud service that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
The method of processing noise data as shown in fig. 1 may include: s101 to S103 are specifically as follows:
S101: noise data screened from a plurality of enterprise data is obtained.
By way of example, enterprise data refers to data related to an enterprise, i.e., various data used to describe the enterprise. For example, the enterprise data of one enterprise may include data of an enterprise name, an operation scope of the enterprise, a camping product of the enterprise, keywords for describing the enterprise, an enterprise profile, an enterprise scale, product information, a research result, and the like.
Each enterprise data may be noise data or non-noise data. Whether noise or non-noise, are labeled (business data may also be understood as labeled) to identify business categories (e.g., forestry, automotive, electronic information, etc.) of the business. The noise data is data with correct index marks and/or incorrect enterprise description, and the non-noise data is data with correct index marks and correct enterprise description.
Specifically, the label error refers to the industry category pointed by the label, which is inconsistent with the industry category to which the enterprise really belongs, and the label accurately refers to the industry category pointed by the label, which is consistent with the industry category to which the enterprise really belongs. The enterprise description error refers to that the description about the business scope, the main product, the enterprise introduction, the enterprise scale, the product information, the research result and the like of the enterprise is wrong, and the enterprise description precisely refers to that the description about the business scope, the main product, the enterprise introduction, the enterprise scale, the product information, the research result and the like of the enterprise is correct.
Illustratively, a plurality of enterprise data is acquired in advance. Wherein the plurality of enterprise data refers to enterprise data of at least two different enterprises. For example, a pre-stored database table may be obtained in a database, where enterprise data for different enterprises is stored. Data such as an enterprise name, an operation range, a main product, keywords for describing an enterprise, and the like of each enterprise can also be extracted from the database table, and the extracted data is used as enterprise data of the enterprise.
Optionally, an enterprise main product table can be obtained, the enterprise name is used as a correlation, data in the database table and the enterprise main product table are summarized in a crossing way, the enterprise name, the operating range, the main product, keywords for describing the enterprise and the like of each enterprise are extracted from the crossing summary result, and the extracted data are used as enterprise data of the enterprise. Because the enterprise data is acquired by adopting the cross summarization mode, the finally acquired enterprise data can be more comprehensive and accurate.
For example, confidence learning may be utilized to screen noise data from the acquired plurality of enterprise data. And the noise data can be screened from the acquired enterprise data by adopting a manual screening mode. Among other things, confidence learning (Confident Learning, CL) is an emerging, principle framework (e.g., machine learning and deep learning frameworks) to characterize tag noise and apply to noisy learning.
Alternatively, the number of acquired business data may be determined based on the number of business categories of the business. The method for processing noise provided by the application can be suitable for enterprise data of different levels of industry categories. For example, enterprises may generally be classified into different levels according to their industry categories, such as primary industry, secondary industry, tertiary industry, quaternary industry, etc. The lower the level, the finer the industry category that is divided. Wherein, the first-level industry is the highest, and the fourth-level industry is the lowest.
For example, the industry classes under the three-level industry are generally more than 400, and the ratio difference of each industry class is large, so as to avoid unbalanced industry classes of enterprises corresponding to the selected enterprise data, or excessive industry classes of enterprises corresponding to the selected enterprise data, which is not beneficial to the subsequent processing of the data. In this embodiment, according to the number of each industry category occupied in the database table, a plurality of industry categories with the front number of industry categories are selected as the head industry, and the remaining industry categories are selected as the non-head industry, which is referred to as other industries in this embodiment. K samples (each enterprise data is one sample) are randomly extracted from a plurality of head industries and a plurality of other industries in the crossed enterprises of the database table and the enterprise main product table, so that enterprise data of a plurality of industry categories are obtained. These enterprise data constitute an equalized sample dataset.
For example, K enterprise data are randomly extracted for 100 head industries and 1 other industry, respectively, in a cross enterprise of database tables and enterprise camping product tables. Namely, K enterprise data are randomly extracted under the industry category of each head industry, and 100 head industries are extracted in total. K pieces of enterprise data are randomly extracted under the industry category of 1 other industry, and a large amount of enterprise data under 101 industry categories (K pieces of enterprise data under each industry category) are obtained. These enterprise data constitute an equalized sample dataset.
Optionally, in order to improve the speed and accuracy of the noise data that is screened, the plurality of acquired enterprise data may be preprocessed before screening the plurality of acquired enterprise data. The preprocessing may include any one or any combination of invalid information cleaning, word segmentation processing, filtering stop word processing, word vector mapping, and the like of the text. For example, the word vector map may employ an externally embedded word vector (e.g., 300-dimensional) map transform, which may result in 1200-dimensional sample features.
S102: information error data and tag error data in the noise data are determined.
Information error data refers to enterprise data for which description about an enterprise is erroneous. If the content of the business scope is too wide, the related descriptions of the business content of the same enterprise on two tables (a database table and an enterprise main product table) are quite different, the content of the business scope is not related to the main product, and the like, the data can cause classification errors and cannot be corrected in the subsequent classification, so the data can be called information error data.
In one embodiment, in order to facilitate a device for processing noise data, information error data is accurately determined in the noise data, and data with a confidence level smaller than a preset threshold value is determined as the information error data. Where confidence refers to reliability and is used to represent the degree of confidence of the data.
For example, a confidence level of each of the noise data is calculated, each confidence level is compared with a preset threshold value, and data whose confidence level is smaller than the preset threshold value as a comparison result is determined as information error data. And determining other data except the information error data in the noise data as tag error data.
S103: and eliminating the information error data and updating the label error data to obtain target noise data.
The information error data is not helpful to the industry classification of enterprises, the accuracy of the subsequent classification can be affected, and the information error data is wrongly described and cannot be quickly corrected, so that the data needs to be removed. Illustratively, when determining information error data in noise data, determining one information error data eliminates one information error data; or when the information error data in the noise data is determined, the determined information error data is marked, and finally, the information error data is removed uniformly.
The tag error data, as the name implies, refers to data indicating a tag error. In the prior art, noise data can be removed, but the data volume of data such as general label error data is large, and the removal of the data can greatly influence the sample number of a subsequent training enterprise classification model. Since such data is only a tag error, such data is updated in the present embodiment. Updating may be understood as correcting the label of the label error data, i.e. correcting the wrong label of the label error data to the correct label.
After the information error data in the noise data is removed and the label error data in the noise data is updated, the obtained data is called target noise data. The target noise data may be used to train the enterprise classification model, i.e., the batch of target noise data may be used to train the enterprise classification model.
In this embodiment, on the basis of the noise data thus screened, information error data and tag error data in the noise data are further determined, and the information error data which cannot be corrected is removed, so that the target noise data finally obtained is free of such data. Because the finally obtained target noise data has no interference of information error data, the accuracy of the enterprise classification model is improved when the enterprise classification model is trained by utilizing the target noise data. In the prior art, noise data can be removed, but the data volume of data such as general label error data is large, and the removal of the data can greatly influence the sample number of a subsequent training enterprise classification model. In the embodiment, the label error data is not directly removed but updated, so that the updated label error data can be used for training the enterprise classification model, and the number of samples required for training the enterprise classification model by using the target noise data later is ensured.
In the embodiment, on the basis of the screened noise data, the information error data and the label error data in the noise data are further determined, so that the noise data is analyzed in a deeper level, and the rejected information error data belongs to real noise data. The label error data is not true noise data, the label error data can be reused after being updated, reasonable utilization of the data is realized, a large number of samples are not required to be acquired when an enterprise classification model is trained later, and therefore resources are saved and training cost is saved.
Referring to fig. 2, fig. 2 is a specific flowchart of step S102 of a method for processing noise data according to another exemplary embodiment of the present application, where S102 may include S1021 to S1025.
S1021: enterprise-related information for each of the noise data is obtained.
The business-related information may include a business name, a business scope, a camping product, keywords for describing the business, and the like. For example, information of an enterprise name, an operation scope, a camping product, keywords for describing an enterprise, and the like may be extracted in each data (i.e., enterprise data).
S1022: and calculating text information entropy, text similarity and/or vector distance of each data according to each enterprise related information.
Illustratively, text information entropy (entropy) refers to the text information entropy of the word set after word segmentation, which is four variables of the name of the enterprise, the business scope, the main product, and the keywords used to describe the enterprise. The larger the text information entropy is, the more the content of the enterprise related information is represented; the smaller the text information entropy, the less content the enterprise-related information is represented. The text information entropy can be calculated by the following expression (1), specifically as follows:
in the above formula (1), entopy * represents text information entropy, scale () is a mathematical function for realizing data normalization, and p i represents the frequency of occurrence of the i-th word in the enterprise-related information.
Illustratively, text similarity (jaccard) refers to the similarity between two variables, the business scope and keywords used to describe the business. After the two variables of the operation range and the keywords for describing the enterprise are subjected to word segmentation, the value obtained by dividing the intersection of the two word sets by the union is the text similarity. The larger the text similarity is, the more similar the operation scope is represented with the keywords used for describing the enterprise; the smaller the text similarity, the less similar the scope of business and keywords used to describe the business. The text similarity can be calculated by the following expression (2).
In the above formula (2), jaccard * represents the text similarity, scale () is a mathematical function for data normalization, (X, Y) respectively represents two texts, intersection (X, Y) represents the intersection of (X, Y), and union (X, Y) represents the union of (X, Y).
Illustratively, vector distance (dist) is also used to represent similarity between two variables, business scope and keywords used to describe the business. The larger the vector distance is, the more dissimilar the scope of business is represented with the keywords used to describe the enterprise; the smaller the vector distance, the more similar the scope of business and keywords used to describe the business. The vector distance can be obtained by calculating the euclidean distance of the operation range and the word vector of the two variables, namely the keywords used for describing the enterprise. Specifically, the vector distance can be calculated by the following expression (3).
In the above expression (3), dist * denotes a vector distance, scale () is a mathematical function for realizing data normalization, and (x i-yi) refers to the i-th component in two text word vectors, respectively.
S1023: and determining the confidence degree of each data according to the text information entropy, the text similarity and/or the vector distance.
By way of example, the confidence level of each data may be determined based on any one or any combination of text information entropy, text similarity, and vector distance. For example, the value of any one of the text information entropy, the text similarity, and the vector distance may be used as the confidence level of the data, or an average value of the text information entropy, the text similarity, and the vector distance may be calculated, and the confidence level of the data may be expressed by the average value.
Alternatively, in one possible implementation, the confidence level of each data may be determined using text information entropy, text similarity, and vector distance. Illustratively, when the confidence of each data is determined by employing three of text information entropy, text similarity, and vector distance together, this can be achieved by the following expression (4).
In the above expression (4), sample_ confident represents the confidence of data, entopy * represents the text information entropy, jaccard * represents text similarity, and dist * represents the vector distance.
S1024: and determining the data with the confidence coefficient smaller than the preset threshold value as information error data.
After the confidence coefficient of each data is calculated, the confidence coefficient of each data is compared with a preset threshold value. The preset threshold is set by the user according to actual conditions, and is not limited. Whether the data is determined to be information error data or tag error data is determined based on the comparison result.
When the comparison result is that the confidence coefficient is smaller than the preset threshold value, step S1024 is executed, namely, the data with the confidence coefficient smaller than the preset threshold value is determined as the information error data. In the present embodiment, S1024 and S1025 are juxtaposed, and S1024 or S1025 is selectively executed according to the comparison result instead of S1025 after S1024.
S1025: and determining the data with the confidence coefficient larger than or equal to a preset threshold value as the label error data.
When the comparison result is that the confidence coefficient is greater than or equal to the preset threshold value, step S1025 is performed, namely, determining the data with the confidence coefficient greater than or equal to the preset threshold value as the label error data.
Alternatively, in a possible implementation manner, after calculating the confidence coefficient of each data, sorting is performed according to the confidence coefficient, and a plurality of data (1% of the total data) with low confidence coefficient is selected as information error data, and the rest data is selected as label error data.
In the embodiment, the confidence coefficient of each data is determined based on the text information entropy, the text similarity and/or the vector distance of each data, the confidence coefficient is used for measuring the confidence level of the enterprise related information, and the data is divided into the label error data and the label error data according to the comparison result of the confidence coefficient and the preset threshold value, so that the label error data and the label error data are accurately divided, and a guarantee is provided for subsequently executing the information error data rejection and updating the label error data. The information error data which are removed later are guaranteed to be real noise data, and meanwhile, the label error data can be reasonably utilized later.
Referring to fig. 3, fig. 3 is a specific flowchart of step S103 of a method for processing noise data according to another exemplary embodiment of the present application, where S103 may include S1031 to S1034.
S1031: and eliminating information error data from the noise data.
Illustratively, when determining information error data in noise data, determining one information error data eliminates one information error data; or when the information error data in the noise data is determined, the determined information error data is marked, and finally, the information error data is removed uniformly.
S1032: and determining the label updatable data and the label non-updatable data in the label error data according to a preset updating strategy.
Tag updatable data refers to data that a tag can be updated, and tag non-updatable data refers to data that a tag does not need to be updated or does not need to be updated. For example, in all the tag error data, although there is a data tag error, the tag is similar to the correct tag, and in order to improve the data processing efficiency and reduce the processing amount, the tag of the tag error data may not be updated.
Optionally, in a possible implementation manner, the steps corresponding to S101 to S103 provided by the present application may be executed for one round, and the target noise data obtained after the execution of the round is used to train the enterprise classification model.
Alternatively, in another possible implementation manner, the steps corresponding to S101 to S103 provided in the present application may also be executed for multiple rounds. For example, the target noise data is obtained after one round of processing, in order to increase accuracy, the noise data in the target noise data is screened again, information error data and label error data in the noise data are determined again, the information error data at the moment is removed, and the label error data at the moment is updated, so that new target noise data is obtained. It should be noted that, in this implementation of performing multiple rounds, since the tags in some of the tag error data have been updated during the previous round of processing, the batch of data is determined to be tag non-updatable data, i.e., the tags of the batch of data are not updated.
When the data in the label error data meets a preset updating strategy, determining the data as label updatable data; when the data in the label error data does not meet the preset updating strategy, the data is determined to be label non-updatable data.
The preset updating policy may be that the data is noise data, the data is label error data, the label of the data is not updated, and the softmax probability corresponding to the original label of the data is smaller than α, and the maximum softmax probability (probability of predicting industry category) > β.
The preset updating strategy can also be that the label of the data is not updated, and the softmax probability corresponding to the original label of the data is smaller than alpha, and the maximum softmax probability (the probability of predicting the industry category) > beta. The softmax probability is a probability obtained by classifying data in the label error data according to a pre-constructed classification model. Alpha and beta are formulated according to the rules in the table below.
TABLE 1
Prob_mean in table 1 represents the probability of predicting an industry category for all enterprise data evaluations, avg represents the average probability of predicting an industry category, border represents the 0.8-bit score for the probability of predicting an industry category.
S1033: updating the tag can update the tag of the data to obtain the updated data.
S1034: and determining target noise data according to the updating data and the label non-updating data.
The data in the label error data is classified according to a pre-constructed classification model, so that probability corresponding to each data is obtained. The probability is used to represent the probability that the data belongs to each industry category. And updating the label of the label updatable data to the label with the highest probability to obtain updated data.
The update data and the tag non-updatable data together comprise target noise data that may be used to train the enterprise classification model.
In this embodiment, the tag of the data satisfying the update policy is updated, and the tag of the data not satisfying the update policy is not updated. On the one hand, as all the tags of the data are not updated, the processing efficiency is improved, and the resources are saved. On the other hand, the labels with the label updatable data are updated, and the obtained target noise data are ensured to be accurate, so that the enterprise classification model with high accuracy can be trained according to the target noise data. And the correction of the data has positive influence on the data mining work of the labels related to the enterprise, so that the situation that the clients are bad in appearance due to the fact that obvious error information is displayed in the products is effectively avoided, and the image and the public belief of the enterprise are improved.
Referring to fig. 4, fig. 4 is a specific flowchart illustrating a method for processing noise data according to still another exemplary embodiment of the present application, the method for processing noise data shown in fig. 4 may include: s201 to S204. S202 to S204 in this embodiment are the same as S101 to S103 in the embodiment corresponding to fig. 1, and are not described in detail in this embodiment, and S201 is specifically as follows:
S201: and screening noise data from the enterprise data by using confidence learning and a preset screening strategy.
For example, a classification model may be pre-constructed, which is used to process the enterprise data to obtain a predictive probability for each enterprise data, which may be used to predict the actual label for each enterprise data. It is understood that the prediction probability is used to predict the probability that the original tag of the enterprise data (the tag originally carried by the enterprise data) belongs to the true tag (the corresponding true tag of the enterprise data). And performing word vector conversion on four variables, namely an enterprise name, an operation range, a main product and a keyword for describing an enterprise, in the enterprise data, for example, obtaining a 1200-dimensional vector, and taking the 1200-dimensional vector as input of the classification model. The output of the classification model is a single hot code of an industry class serial number, such as a 101-dimensional vector.
In constructing the classification model, neurons of the same dimension as the input vector may be set for the input layer of the classification model, such as 1200 neurons for the input layer of the classification model. The input layer is sequentially connected with two full-connection layers (dense_layer), batch normalization layer (BN_layer) and Dropout layer (dropout_layer), and then connected with a neuron with the same dimension as the output vector, such as 101 neurons. The fully connected layer with an activation function softmax serves as the output layer. The loss function of the classification model may be a weighted cross entropy loss function as follows:
In the above expression (5), loss represents a Loss function of the classification model, weight i represents a weight of the i-th input data, Representing the true probability (e.g., 0 or 1) of the ith input data under the jth industry category,/>Representing the predicted probability of the ith input data under the jth industry category, initial represents initialization.
After the classification model is built, the classification model can be trained by a K-fold cross validation method. Specifically, a plurality of enterprise data is divided into K parts on average, K-1 parts are used as a training data set, the remaining 1 parts are tested with the classification model, and K rounds are repeatedly performed. It should be noted that, the enterprise data tested in each round is different, so that after repeating the execution of K rounds, a prediction probability corresponding to each enterprise data can be obtained, and the prediction probability may also be referred to as an extrapolant prediction probability.
In this embodiment, confidence learning refers to a process of predicting a true label of each enterprise data by a prediction probability of each enterprise data, and estimating a joint probability distribution of the original label and the true label from the original label and the true label of each enterprise data.
In the embodiment, on the basis of filtering noise data by using confidence learning, a filtering strategy is combined, so that the finally determined noise data is more accurate, and the subsequently determined target noise data is more accurate.
Optionally, in one possible implementation, S201 may include S2011 to S2014.
S2011: and processing the plurality of enterprise data by using the constructed classification model to obtain the prediction probability of each enterprise data.
For example, a plurality of enterprise data may be divided into K parts on average, K-1 parts are used as training data sets, and the remaining 1 parts are tested by using the constructed classification model, so that K rounds are repeatedly performed, and a prediction probability corresponding to each enterprise data may be obtained.
S2012: the true tags for each enterprise data are predicted based on the prediction probabilities for each enterprise data.
The label originally carried by the enterprise data is called an original label, the real label is the real label corresponding to the enterprise data book, and the real label can be understood to be used for identifying the industry category to which the enterprise data truly belongs.
Illustratively, the dimension with the highest probability and the highest probability greater than the industry category threshold may be taken as an estimate of the real tag, specifically by:
In the above-mentioned formula (6), Representing confidence thresholds for each industry category, P ij represents the predicted probability of the ith input data under the jth industry category,/>Representing the mean value of the prediction probabilities corresponding to the ith input data.
S2013: based on the original tag and the real tag of each enterprise data, a joint probability distribution of the original tag and the real tag is estimated.
Illustratively, the counting matrix is calculated according to the original label of each enterprise data, which can be specifically realized by the following formula:
In the above formula (7), C represents a count matrix, Representing the original label as K, there are c enterprise data for the predicted label as 1.
The counting matrix is standardized to obtain a standardized counting matrixThis may make the sum of the counts of the count matrix equal to the total amount of enterprise data, which may be achieved by:
in the above-mentioned formula (8), Representing a normalized count matrix,/>Wherein/>Representing the original tag as K, the enterprise data for which the predicted tag is 1 has c,/>Representing the number of enterprise data with original label K. /(I)
Estimating the joint probability distribution of the original tag and the real tag, which can be realized by the following formula:
In the above expression (9), Q represents a joint probability distribution of the original tag and the genuine tag, Representing the original tag as K, the probability of predicting the tag as 1 is q,/>Representing the original label as K, there are c enterprise data for the predicted label as 1.
S2014: and screening enterprise data conforming to a screening strategy from the plurality of enterprise data based on the joint probability distribution to obtain noise data.
The screening policies may include noise rate screening policies and/or industry class screening policies.
In one embodiment, when screening according to the noise rate screening strategy, the enterprise data with the original label K, the predicted label 1 and K not equal to 1 can be selected according to the probability intervalSorting is carried out, and/>, with low probability interval, is selected from sorting resultsAnd each is noise data.
In another embodiment, when screening according to the industry category screening policy, for the enterprise data with the original label K, the method uses the original probabilitySorting is carried out, and/>, with low original probability, is screened in the sorting resultAnd each is noise data.
In yet another embodiment, two screening strategies, i.e., a noise rate screening strategy and an industry class screening strategy, are combined, and the union of noise data screened by the two screening strategies is taken as final noise data.
In the embodiment, the noise data is screened through a plurality of screening strategies, so that the finally determined noise data is more accurate, and the subsequently determined target noise data is more accurate.
Referring to fig. 5, fig. 5 is a specific flowchart illustrating a method for processing noise data according to another exemplary embodiment of the present application, the method for processing noise data shown in fig. 5 may include: s301 to S304. S301 to S303 in this embodiment are the same as S101 to S103 in the embodiment corresponding to fig. 1, and are not described in detail in this embodiment, and S304 is specifically as follows:
S304: and performing M-round training on the basic model by utilizing the target noise data to obtain an enterprise classification model set.
M is a positive integer, and is set by a user according to actual conditions. As M varies, the underlying model also varies. When M is 1, the basic model is a pre-constructed classification model; when M is not 1, the basic model adopted by each round of training is different, namely when M is not 1, the basic model is an enterprise classification model obtained during M-1 rounds of training, and it can be understood that when M is not 1, the basic model used by each round of training is an enterprise classification model obtained during the previous round of training.
Since the value of M may be set by the user according to the actual situation, optionally, in a possible implementation manner, when M is 1, a round of training is performed on the base model by using the target noise data, to obtain an enterprise classification model. At this time, the enterprise classification model is the training result that we finally need. Or it may be understood that only one enterprise classification model is included in the set of enterprise classification models for this time. The enterprise classification model can be used for processing the data to be classified of the enterprise, so that an industry classification result of the enterprise is obtained.
Alternatively, in another possible implementation, when M is not 1, the base model is trained with the target noise data for multiple rounds, each round resulting in an enterprise classification model. And obtaining a plurality of enterprise classification models after multiple rounds of training, and forming an enterprise classification model set by the plurality of enterprise classification models. At this time, the enterprise classification model set is the training result that we finally need. The enterprise classification model set can be utilized to process the data to be classified of the enterprise, so that an industry classification result of the enterprise is obtained.
In this embodiment, on the basis of the noise data thus screened, information error data and tag error data in the noise data are further determined, and the information error data which cannot be corrected is removed, so that the target noise data finally obtained is free of such data. Because the finally obtained target noise data has no interference of information error data, the accuracy of the enterprise classification model is improved when the enterprise classification model is trained by utilizing the target noise data. In the prior art, noise data can be removed, but the data volume of data such as general label error data is large, and the removal of the data can greatly influence the sample number of a subsequent training enterprise classification model. In this embodiment, the label error data is not directly removed, but updated, so that the updated label error data can also be used for training the enterprise classification model. Therefore, when the enterprise classification model is trained based on the target noise data, the number of samples required by training is ensured. And because the labels in the target noise data are corrected, the accuracy is guaranteed, and therefore, the accuracy of the enterprise classification model can be improved by training the enterprise classification model based on the target noise data.
Alternatively, in one possible implementation, the S304 may include S3041 to S3044.
S3041: and determining a training sample set corresponding to the ith training.
Illustratively, the values of i are different, as are the training sample sets required for this round of training. i is a positive integer and sequentially increases, and i is less than or equal to M. For example, when i=1, a plurality of pieces of enterprise data may be acquired, and these pieces of enterprise data may be used as a training sample set corresponding to the 1 st round of training.
When i is not equal to 1, the training sample set required by the training of the present round can be determined based on the training sample set used in the previous round of training. For example, noise data is screened again in the training sample set used in the previous training round. In the training sample set used in the previous training, the rest data are non-noise data except the noise data screened again. And then, determining information error data and label error data in the noise data screened at the moment, removing the information error data at the moment and updating the label error data at the moment to obtain new target noise data. The new target noise data and non-noise data constitute the training sample set required for this round of training. In each round of training process, the method can accurately filter out real noise data, and label correction is carried out on a part of noise data, so that the high-efficiency utilization of the noise data is realized, the number of samples in a training sample set is ensured, and the enterprise classification model finally obtained by training is accurate.
S3042: and determining a basic model corresponding to the ith training.
Illustratively, the values of i are different, as are the training sample sets required for this round of training. i is a positive integer and sequentially increases, and i is less than or equal to M. For example, when i=1, a classification model constructed in advance may be used as a base model; when i is not equal to 1, the enterprise classification model obtained in the previous training round can be used as a basic model.
S3043: and training the basic model corresponding to the ith training according to the training sample set corresponding to the ith training to obtain the enterprise classification model corresponding to the ith training.
S3044: and forming an enterprise classification model set according to the enterprise classification model obtained by training each round.
Illustratively, the basic model corresponding to each wheel training is trained using the training sample set corresponding to that wheel training. If the loss function corresponding to the basic model is not converged, continuing training; if the loss function corresponding to the basic model converges, the result shows that the training of the round meets the requirement, and the trained model is used as the enterprise classification model corresponding to the training of the round.
Or dividing the training sample set into K parts, training the basic model by K-1 parts, and repeating the K rounds of test on the rest 1 parts, so as to obtain the prediction probability corresponding to each enterprise data in the training sample set, and taking the basic model at the moment as the enterprise classification model corresponding to the training of the round. Each round of training results in an enterprise classification model, which is used to construct an enterprise classification model set.
In this embodiment, the basic model corresponding to each round of training is trained based on the training sample set required for each round of training, so as to obtain a plurality of enterprise classification models. As each subsequent training round is whether the training sample set is required for training or the basic model is required for training, the training efficiency and accuracy are ensured on the basis of the previous training round. And the training sample set contains target noise data, so that the number of samples required by training is ensured. The labels in the target noise data are corrected, and the accuracy is guaranteed, so that the accuracy of the enterprise classification model can be improved by training the enterprise classification model based on the target noise data. Industry category correction of new enterprise data is achieved, and more accurate data policy support is brought to a guest-topology direction.
Alternatively, in one possible implementation, the above S3041 may include S30401 to S30115.
S30411: when i=1, non-noise data among the plurality of enterprise data is acquired.
S30512: a training sample set is constructed from the non-noise data and the target noise data.
Illustratively, when i=1, in screening noise data among the plurality of enterprise data, the remaining enterprise data is determined as non-noise data in addition to the noise data. And taking the non-noise data and the target noise data as training sample sets corresponding to the training of the 1 st round.
S30411: when i is not equal to 1, non-noise data corresponding to the ith training is determined.
S30414: and determining target noise data corresponding to the ith training.
Illustratively, when i+.1, the training sample set corresponding to the ith round of training is determined, the training sample set of the ith-1 round of training is required. For example, a training sample set used in the i-1 training is obtained, noise data is screened in the training sample set used in the i-1 training, and the screening method may refer to the description method in S201, which is not repeated here. In the training sample set used in the i-1 round of training, except the noise data screened out at the present time, the rest data are non-noise data, namely the non-noise data corresponding to the i-th round of training.
And then, determining information error data and label error data in the noise data screened at the moment, removing the information error data at the moment and updating the label error data at the moment to obtain new target noise data. The new target noise data is the target noise data corresponding to the ith training.
In the embodiment, the non-noise data corresponding to the current training is determined on the basis of the previous training, so that the quality of the samples of the current training is ensured.
It should be noted that, the specific process of filtering the information error data and the label error data in the noise data corresponding to the i-1 th round training, removing the information error data in the noise data corresponding to the i-1 th round training, and updating the label error data in the noise data corresponding to the i-1 th round training may refer to the descriptions in S102 and S103, and will not be repeated here.
In the embodiment, the target noise data corresponding to the current training is determined on the basis of the previous training, so that the quality of the samples of the current training is ensured.
S3049: and forming a training sample set corresponding to the ith wheel training according to the non-noise data corresponding to the ith wheel training and the target noise data corresponding to the ith wheel training.
Illustratively, non-noise data corresponding to the ith training and target noise data corresponding to the ith training are used as training sample sets corresponding to the ith training.
And taking the non-noise data and the new target noise data in the training sample set used in the previous training round as the training sample set of the previous training round.
In the embodiment, the training sample set required by the ith round of training is determined on the basis of the training sample set corresponding to the i-1 round of training, so that the training efficiency and accuracy are ensured. And the training sample set contains target noise data, so that the number of samples required by training is ensured. The labels in the target noise data are corrected, and the accuracy is guaranteed, so that the accuracy of the enterprise classification model can be improved by training the enterprise classification model based on the target noise data.
Alternatively, in one possible implementation, the tag error data in the training sample set may be updated each time during each of the previous exercises, as well as the weights of the individual data in the training sample set. For example, in the ith round of training, calculating the joint distribution probability of a training sample set used for the ith round of training; according to the joint distribution probability of the ith training, the weight of each data in the training sample set used in the ith training is adjusted, and the adjusted weight is used for adjusting the weight of the loss function in the (i+1) th training.
Illustratively, during the ith round of training, a joint distribution probability of a training sample set used for the ith round of training is calculated. Specific calculation methods may refer to the description in S2013, and will not be described herein. According to the joint distribution probability of the ith training, the weight of each data in the training sample set used in the ith training is updated, and the method can be specifically realized by the following formula:
in the above-mentioned formula (10), The original label representing the ith training sample is k, predicted as l,/>, in the ith round of trainingRepresenting the sample weight of the ith training sample at the time of the ith round of training.
By adopting the method for selecting the training sample set and the estimation of the joint distribution probability, the learning strength of the model on the correct sample in the next round of training can be enhanced. Consider that noise data is given a lower weight when calculating the loss function, and correctly classified data is given a higher weight. The high joint distribution probability of the noise data indicates that the probability of data errors is high, the opposite number of the joint distribution probability reduces the weight of the noise data, and otherwise, the joint distribution probability combined with correct classification improves the weight of the correct data.
Optionally, to approximate the sum of the weights of the penalty functions calculated at each iteration to the batch_size, the sample weights are converted as follows:
In the above-mentioned formula (11), Representing the sample weight after the conversion of the ith training sample.
Optionally, in a possible implementation, S305 and S306 may also be included after S304, which is specifically as follows.
S305: and obtaining data to be classified of the enterprise.
The data to be classified of the enterprise may be the same as the content contained in the enterprise data in S101, that is, the data to be classified of the enterprise may also include an enterprise name, an operation range of the enterprise, a main product of the enterprise, keywords for describing the enterprise, an enterprise profile, an enterprise scale, product information, a research result, and the like. In the present embodiment, the training set of enterprise classification models is specifically applied, so that the data to be classified of the enterprise has no label.
S306: and inputting the data to be classified into an enterprise classification model set for processing to obtain an enterprise industry classification result.
For example, when the enterprise classification model set only includes one enterprise classification model, the data to be classified is processed through the enterprise classification model, and the output result of the enterprise classification model is the industry classification result of the enterprise. When the enterprise classification model set comprises a plurality of enterprise classification models, processing data to be classified through each enterprise classification model in the enterprise classification model set to obtain a plurality of industry classification results, and determining a final industry classification result of the enterprise according to the plurality of industry classification results.
In this embodiment, since the trained enterprise classification model set is obtained through training in the manner of S304, each enterprise classification model in the enterprise classification model set is very accurate, and the accuracy of the industry classification result of the enterprise obtained finally is improved by processing the data to be classified through the enterprise classification model set.
Alternatively, in one possible implementation, S306 may include S3061-S3063, as follows.
S3061: and predicting the data to be classified through each enterprise classification model in the enterprise classification model set to obtain a plurality of prediction results.
Illustratively, when the set of enterprise classification models includes a plurality of enterprise classification models, the data to be classified is processed by each enterprise classification model in the set of enterprise classification models, each enterprise classification model outputting a prediction result.
S3062: and obtaining the model weight corresponding to each enterprise classification model.
The model weights corresponding to each enterprise classification model are calculated by the following method.
In the above equation (12), α M represents the weight of the enterprise classification model corresponding to the current M rounds,Representing the sample weight of the ith training sample in the ith training round, and e M represents the weighted classification accuracy of the current M-round enterprise classification model.
S3063: and determining an industry classification result of the enterprise according to the plurality of prediction results and the weight of each model.
Illustratively, calculating the final industry classification result of the enterprise according to the plurality of prediction results and each model weight can be specifically achieved by the following formula.
In the above formula (13), f (x) is represented by a vector, the maximum value of f (x) is taken as the final industry classification result of the enterprise, α M represents the weight of the enterprise classification model corresponding to the current M rounds, and G M (x) represents the prediction result of each round.
In the embodiment, the industry classification result of the enterprise is determined through a plurality of enterprise classification models in the enterprise classification model set, so that the accuracy of the industry classification result is further improved.
Alternatively, in one possible implementation, a batch of enterprise data may be acquired first and processed, such as text vectorization processing and computing a confidence level for each data. And training the classification model by using the K-fold cross validation method by utilizing the classification model and initializing the weight of each data, so as to obtain the extrapolant prediction probability of all enterprise data.
Noise samples are filtered based on confidence learning and the weight of each data is updated. Specifically, according to the prediction probability corresponding to each enterprise data, predicting the real label of each enterprise data, according to the original label and the real label of each enterprise data, estimating the joint probability distribution of the original label and the real label, and screening noise data from a plurality of enterprise data based on the joint probability distribution and a preset screening strategy. And updating the weight of each data according to the result of confidence learning to enhance the learning of correct data in the next round of training, and giving a larger sample weight to the data (correct data) with the category of the prediction industry consistent with the label in the next round of training, and otherwise giving a smaller weight to the data (noise data) with the category of the prediction industry inconsistent with the original label.
And updating the data according to the prediction result and the screening condition of the noise data. Specifically, in the noise data screened out, the data with low confidence is used as information error data, the rest data is used as label error data, and the information error data is removed.
To enhance the effect of the next round of iterative training, the label error data is updated. And (3) formulating a label updating strategy according to the prediction probability level of each label, determining label updatable data and label non-updatable data in the label error data according to the updating strategy, updating the label of the label updatable data, and keeping the original label of the label non-updatable data.
The training and cleaning data are iterated until the increasing speed of the updated recall number of the data is converged, which can be understood that the number of noise data screened each time is smaller and gradually tends to be a numerical value. And applying the updated data and the weight of the data to the next training round, wherein the increase speed of the updated recall quantity of the data is converged, and relatively correct enterprise data is obtained. The trusted enterprise classification model is trained using the relatively correct set of enterprise data.
And giving different weights to each enterprise classification model according to the classification accuracy of each enterprise classification model, and synthesizing the prediction results of the enterprise classification models in a linear weighting mode to obtain a final industry classification result.
In the implementation manner, the industry related information of the enterprise is used as a characteristic, the tag is used as a response variable, the joint probability distribution of the original tag and the real tag is estimated through the classification prediction result of the classification model on the data, the noise data is screened, and the confidence level index is constructed to screen the information error data. And performing operations such as information error data rejection, label error data update, correct data retention and the like on the original enterprise data, thereby achieving the purpose of cleaning the data. Each time the data and the weight of the data are updated, the data updating speed converges after a plurality of iterations to obtain relatively clean training data, each enterprise classification model is given different weights according to the classification accuracy of each enterprise classification model, the prediction results of a plurality of enterprise classification models are integrated in a linear weighting mode to obtain a final industry classification result, and the industry classification correction of the enterprise data can be realized.
Referring to fig. 6, fig. 6 is a schematic diagram of an apparatus for processing noise data according to an embodiment of the application. The apparatus for processing noise data comprises units for performing the steps in the corresponding embodiments of fig. 1 to 5. Refer specifically to the related descriptions in the respective embodiments of fig. 1 to 5. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 6, comprising:
an obtaining unit 410, configured to obtain noise data screened from a plurality of enterprise data, where the noise data has a tag, and the tag is used to identify an industry category of an enterprise;
A determining unit 420, configured to determine information error data and tag error data in the noise data, where the information error data is data with a confidence level less than a preset threshold, and the tag error data is data in the noise data except for the information error data;
The processing unit 430 is configured to reject the information error data and update the tag error data for the noise data, so as to obtain target noise data, where the target noise data is used to train the enterprise classification model.
Optionally, the determining unit 420 is specifically configured to: acquiring enterprise related information of each data in the noise data; according to each enterprise related information, calculating text information entropy, text similarity and/or vector distance of each data; determining the confidence coefficient of each data according to the text information entropy, the text similarity and the vector distance; determining the data with the confidence coefficient smaller than the preset threshold value as the information error data; and determining the data with the confidence coefficient larger than or equal to the preset threshold value as the label error data.
Optionally, the processing unit 430 is specifically configured to: removing the information error data from the noise data; determining label updatable data and label non-updatable data in the label error data according to a preset updating strategy; updating the label of the label updatable data to obtain updated data; and determining the target noise data according to the updating data and the label non-updating data.
Optionally, the apparatus further comprises: and the screening unit is used for screening the noise data from the enterprise data by using confidence learning and a preset screening strategy.
Optionally, each enterprise data carries an original tag, and the screening unit is specifically configured to: processing the plurality of enterprise data by using the constructed classification model to obtain the prediction probability of each enterprise data; predicting the real label of each enterprise data according to the prediction probability of each enterprise data; estimating joint probability distribution of the original tag and the real tag according to the original tag and the real tag of each enterprise data; and screening enterprise data conforming to the screening strategy from the plurality of enterprise data based on the joint probability distribution to obtain the noise data, wherein the screening strategy comprises a noise rate screening strategy and/or an industry category screening strategy.
Optionally, the apparatus further comprises: the training unit is used for performing M-turn training on the basic model by utilizing the target noise data to obtain an enterprise classification model set, M is a positive integer, the enterprise classification model set comprises M enterprise classification models, and the basic models adopted by each turn of training are different.
Optionally, the training unit is specifically configured to: determining a training sample set corresponding to the ith training, wherein i is a positive integer and is sequentially increased, i is less than or equal to M, and the training sample set adopted by each training is different; determining a basic model corresponding to the ith training; training the basic model corresponding to the ith training according to the training sample set corresponding to the ith training to obtain an enterprise classification model corresponding to the ith training; and forming the enterprise classification model set according to the enterprise classification model obtained by training each round.
Optionally, the training unit is further configured to: acquiring non-noise data of the plurality of enterprise data when i=1; constructing the training sample set according to the non-noise data and the target noise data; when i is not equal to 1, determining non-noise data corresponding to the ith training; determining target noise data corresponding to the ith training; and forming a training sample set corresponding to the ith wheel training according to the non-noise data corresponding to the ith wheel training and the target noise data corresponding to the ith wheel training.
Optionally, the training unit is further configured to: determining noise data in a training sample set corresponding to the i-1 th round of training; determining information error data and label error data in noise data corresponding to the i-1 th training; and eliminating information error data in the noise data corresponding to the i-1 th wheel training, and updating label error data in the noise data corresponding to the i-1 th wheel training to obtain target noise data corresponding to the i-1 th wheel training.
Optionally, the apparatus further comprises: the adjusting unit is used for calculating the joint distribution probability of the training sample set used for the ith round of training in the ith round of training process; according to the joint distribution probability of the ith training, the weight of each data in the training sample set used in the ith training is adjusted, and the adjusted weight is used for adjusting the weight of the loss function in the (i+1) th training.
Optionally, the apparatus further comprises: the using unit is used for acquiring data to be classified of the enterprise; and inputting the data to be classified into the enterprise classification model set for processing to obtain an industry classification result of the enterprise.
Optionally, the usage unit is further configured to: predicting the data to be classified through each enterprise classification model in the enterprise classification model set to obtain a plurality of prediction results; obtaining model weights corresponding to each enterprise classification model; and determining an industry classification result of the enterprise according to the plurality of prediction results and each model weight.
Referring to fig. 7, fig. 7 is a schematic diagram of an apparatus for processing noise data according to another embodiment of the present application. As shown in fig. 7, the apparatus 5 for processing noise data of this embodiment includes: a processor 50, a memory 51 and a computer program 52 stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps of the respective method embodiments for processing noise data described above, such as S101 to S103 shown in fig. 1. Or the processor 50, when executing the computer program 52, performs the functions of the units in the embodiments described above, for example the units 410 to 430 shown in fig. 6.
Illustratively, the computer program 52 may be partitioned into one or more units that are stored in the memory 51 and executed by the processor 50 to complete the present application. The one or more units may be a series of computer instruction segments capable of performing a specific function for describing the execution of the computer program 52 in the device 5 for processing noise data. For example, the computer program 52 may be divided into an acquisition unit, a determination unit and a processing unit, each unit functioning specifically as described above.
The device may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a device 5 that processes noise data and is not limiting of the device, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the device may also include input-output devices, network access devices, buses, etc.
The Processor 50 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 51 may be an internal storage unit of the device, such as a hard disk or a memory of the device. The memory 51 may also be an external storage terminal of the device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the device. Further, the memory 51 may also include both an internal storage unit and an external storage terminal of the device. The memory 51 is used for storing the computer instructions and other programs and data required by the terminal. The memory 51 may also be used to temporarily store data that has been output or is to be output.
The embodiment of the application also provides a computer storage medium, which can be nonvolatile or volatile, and stores a computer program, and the computer program is executed by a processor to implement the steps in the above-mentioned method embodiments for processing noise data.
The application also provides a computer program product which, when run on a device, causes the device to perform the steps of the respective method embodiments described above for processing noise data.
The embodiment of the application also provides a chip or an integrated circuit, which comprises: and a processor for calling and running the computer program from the memory, so that the device on which the chip or the integrated circuit is mounted performs the steps in the respective method embodiments for processing noise data.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (14)

1. A method of processing noise data, comprising:
acquiring noise data screened from a plurality of enterprise data, wherein the noise data is provided with a label, and the label is used for identifying the industry category of an enterprise;
Determining information error data and label error data in the noise data, wherein the information error data is data with confidence coefficient smaller than a preset threshold value, and the label error data is data except the information error data in the noise data;
And aiming at the noise data, eliminating the information error data and updating the label error data to obtain target noise data, wherein the target noise data is used for training an enterprise classification model.
2. The method of claim 1, wherein the determining information error data and tag error data in the noise data comprises:
Acquiring enterprise related information of each data in the noise data;
According to each enterprise related information, calculating text information entropy, text similarity and vector distance of each data;
determining the confidence coefficient of each data according to the text information entropy, the text similarity and/or the vector distance;
determining the data with the confidence coefficient smaller than the preset threshold value as the information error data;
And determining the data with the confidence coefficient larger than or equal to the preset threshold value as the label error data.
3. The method of claim 1, wherein said culling the information error data and updating the tag error data to obtain target noise data comprises:
Removing the information error data from the noise data;
Determining label updatable data and label non-updatable data in the label error data according to a preset updating strategy;
updating the label of the label updatable data to obtain updated data;
and determining the target noise data according to the updating data and the label non-updating data.
4. A method according to any one of claims 1 to 3, wherein prior to said obtaining noise data screened out of a plurality of enterprise data, the method further comprises:
and screening the noise data from the plurality of enterprise data by using confidence learning and a preset screening strategy.
5. The method of claim 4, wherein each enterprise data is provided with an original tag, wherein the screening the noise data from the plurality of enterprise data using confidence learning and a preset screening policy comprises:
processing the plurality of enterprise data by using the constructed classification model to obtain the prediction probability of each enterprise data;
predicting the real label of each enterprise data according to the prediction probability of each enterprise data;
Estimating joint probability distribution of the original tag and the real tag according to the original tag and the real tag of each enterprise data;
and screening enterprise data conforming to the screening strategy from the plurality of enterprise data based on the joint probability distribution to obtain the noise data, wherein the screening strategy comprises a noise rate screening strategy and/or an industry category screening strategy.
6. The method of any one of claims 1 to 5, further comprising:
And performing M-turn training on the basic model by utilizing the target noise data to obtain an enterprise classification model set, wherein M is a positive integer, the enterprise classification model set comprises M enterprise classification models, and the basic models adopted by each turn of training are different.
7. The method of claim 6, wherein the training the base model with the target noise data in M rounds results in a set of enterprise classification models, comprising:
determining a training sample set corresponding to the ith training, wherein i is a positive integer and is sequentially increased, i is less than or equal to M, and the training sample set adopted by each training is different;
determining a basic model corresponding to the ith training;
Training the basic model corresponding to the ith training according to the training sample set corresponding to the ith training to obtain an enterprise classification model corresponding to the ith training;
And forming the enterprise classification model set according to the enterprise classification model obtained by training each round.
8. The method of claim 7, wherein the determining the training sample set corresponding to the ith round of training comprises:
acquiring non-noise data of the plurality of enterprise data when i=1;
Constructing the training sample set according to the non-noise data and the target noise data;
When i is not equal to 1, determining non-noise data corresponding to the ith training;
determining target noise data corresponding to the ith training;
and forming a training sample set corresponding to the ith wheel training according to the non-noise data corresponding to the ith wheel training and the target noise data corresponding to the ith wheel training.
9. The method of claim 8, wherein the determining the target noise data for the ith training round comprises:
Determining noise data in a training sample set corresponding to the i-1 th round of training;
determining information error data and label error data in noise data corresponding to the i-1 th training;
And eliminating information error data in the noise data corresponding to the i-1 th wheel training, and updating label error data in the noise data corresponding to the i-1 th wheel training to obtain target noise data corresponding to the i-1 th wheel training.
10. The method of claim 6, wherein the method further comprises:
in the ith training process, calculating the joint distribution probability of a training sample set used in the ith training;
according to the joint distribution probability of the ith training, the weight of each data in the training sample set used in the ith training is adjusted, and the adjusted weight is used for adjusting the weight of the loss function in the (i+1) th training.
11. The method of claim 6, wherein after performing M-round training on the base model using the target noise data to obtain the set of enterprise classification models, the method further comprises:
Acquiring data to be classified of an enterprise;
And inputting the data to be classified into the enterprise classification model set for processing to obtain an industry classification result of the enterprise.
12. The method of claim 11, wherein inputting the data to be classified into the enterprise classification model set for processing to obtain an industry classification result for the enterprise, comprises:
Predicting the data to be classified through each enterprise classification model in the enterprise classification model set to obtain a plurality of prediction results;
Obtaining model weights corresponding to each enterprise classification model;
And determining an industry classification result of the enterprise according to the plurality of prediction results and each model weight.
13. An apparatus for processing noise data, comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring noise data screened from a plurality of enterprise data, the noise data is provided with a label, and the label is used for identifying the industry category of an enterprise;
The determining unit is used for determining information error data and label error data in the noise data, wherein the information error data is data with confidence coefficient smaller than a preset threshold value;
The processing unit is used for eliminating the information error data and updating the label error data to obtain target noise data, and the target noise data is used for training an enterprise classification model.
14. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1 to 12.
CN202211396523.3A 2022-11-09 2022-11-09 Method, device, equipment and storage medium for processing noise data Pending CN118013188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211396523.3A CN118013188A (en) 2022-11-09 2022-11-09 Method, device, equipment and storage medium for processing noise data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211396523.3A CN118013188A (en) 2022-11-09 2022-11-09 Method, device, equipment and storage medium for processing noise data

Publications (1)

Publication Number Publication Date
CN118013188A true CN118013188A (en) 2024-05-10

Family

ID=90953042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211396523.3A Pending CN118013188A (en) 2022-11-09 2022-11-09 Method, device, equipment and storage medium for processing noise data

Country Status (1)

Country Link
CN (1) CN118013188A (en)

Similar Documents

Publication Publication Date Title
TWI769754B (en) Method and device for determining target business model based on privacy protection
AU2021232839B2 (en) Updating Attribute Data Structures to Indicate Trends in Attribute Data Provided to Automated Modelling Systems
US10013636B2 (en) Image object category recognition method and device
CN109948149B (en) Text classification method and device
CN107766929B (en) Model analysis method and device
CN111783875A (en) Abnormal user detection method, device, equipment and medium based on cluster analysis
CN111476256A (en) Model training method and device based on semi-supervised learning and electronic equipment
CN110046634B (en) Interpretation method and device of clustering result
CN110364185B (en) Emotion recognition method based on voice data, terminal equipment and medium
CN111507470A (en) Abnormal account identification method and device
CN110991474A (en) Machine learning modeling platform
CN110135681A (en) Risk subscribers recognition methods, device, readable storage medium storing program for executing and terminal device
CN111626821A (en) Product recommendation method and system for realizing customer classification based on integrated feature selection
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN112784031B (en) Method and system for classifying customer service conversation texts based on small sample learning
CN116596095B (en) Training method and device of carbon emission prediction model based on machine learning
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
CN114332500A (en) Image processing model training method and device, computer equipment and storage medium
CN112884570A (en) Method, device and equipment for determining model security
CN111783883A (en) Abnormal data detection method and device
CN116204647A (en) Method and device for establishing target comparison learning model and text clustering
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN115641201A (en) Data anomaly detection method, system, terminal device and storage medium
CN115796635A (en) Bank digital transformation maturity evaluation system based on big data and machine learning
US20210241147A1 (en) Method and device for predicting pair of similar questions and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination