CN114429581A - Data processing method and device, equipment, medium and product - Google Patents

Data processing method and device, equipment, medium and product Download PDF

Info

Publication number
CN114429581A
CN114429581A CN202210097295.3A CN202210097295A CN114429581A CN 114429581 A CN114429581 A CN 114429581A CN 202210097295 A CN202210097295 A CN 202210097295A CN 114429581 A CN114429581 A CN 114429581A
Authority
CN
China
Prior art keywords
sample data
data
noise
target
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210097295.3A
Other languages
Chinese (zh)
Inventor
李鑫
温圣召
张刚
冯浩城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210097295.3A priority Critical patent/CN114429581A/en
Publication of CN114429581A publication Critical patent/CN114429581A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides a data processing method, apparatus, device, medium, and product, which relate to the technical field of artificial intelligence, and in particular, to the technical field of deep learning and computer vision, and can be applied to scenes such as face recognition. The specific implementation scheme comprises the following steps: performing feature extraction on the sample data to obtain sample data features; determining whether the sample data is noise sample data or not according to the sample data characteristics and preset noise data characteristics; and in response to determining that the sample data is non-noise sample data, performing deep network model training using the sample data.

Description

Data processing method and device, equipment, medium and product
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and more particularly to the field of deep learning and computer vision technology, applicable to human face recognition and other scenarios.
Background
The noise sample data screening is carried out on the sample data set, and the optimization efficiency and the training effect of the deep network model can be effectively improved. However, in some scenes, the phenomenon of low screening efficiency and high cost consumption exists in the noise sample data screening.
Disclosure of Invention
The present disclosure provides a data processing method and apparatus, device, medium and product.
According to an aspect of the present disclosure, there is provided a data processing method including: performing feature extraction on the sample data to obtain sample data features; determining whether the sample data is noise sample data or not according to the sample data characteristics and preset noise data characteristics; and in response to determining that the sample data is non-noise sample data, performing deep network model training using the sample data.
According to another aspect of the present disclosure, there is provided a data processing apparatus including: the first processing module is used for extracting the characteristics of the sample data to obtain the characteristics of the sample data; the second processing module is used for determining whether the sample data is noise sample data or not according to the sample data characteristics and preset noise data characteristics; and the third processing module is used for responding to the fact that the sample data is determined to be non-noise sample data, and carrying out deep network model training by utilizing the sample data.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the data processing method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the above-described data processing method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the data processing method described above.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 schematically shows a system architecture of a data processing method and apparatus according to an embodiment of the present disclosure;
FIG. 2 schematically shows a flow diagram of a data processing method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a method of determining noise-like data characteristics according to an embodiment of the present disclosure;
FIG. 4 schematically shows a schematic diagram of a data processing procedure according to an embodiment of the present disclosure;
FIG. 5 schematically shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure;
fig. 6 schematically shows a block diagram of an electronic device for performing data processing according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The embodiment of the disclosure provides a data processing method. The data processing method comprises the following steps: the method comprises the steps of extracting features of sample data to obtain sample data features, determining whether the sample data is noise sample data or not according to the sample data features and preset noise data features, and performing deep network model training by using the sample data in response to the fact that the sample data is determined to be non-noise sample data.
Fig. 1 schematically shows a system architecture of a data processing method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
The system architecture 100 according to this embodiment may include a data collection side 101, a network 102, and a server 103. Network 102 is the medium used to provide a communication link between data collection end 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud computing, network services, middleware services, and the like.
The data acquisition terminal 101 interacts with the server 103 through the network 102 to receive or transmit data and the like. The data collection end 101 may be configured to collect sample data for training the deep network model, where the sample data may be, for example, unlabeled data returned from the production environment.
The server 103 may be a server providing various services, for example, a background processing server (for example only) for filtering sample data provided by the data acquisition end 101. The background processing server can screen the received sample data set to remove noise sample data in the sample data set, so as to obtain normal sample data used for training the deep network model.
For example, the server 103 receives sample data from the data acquisition terminal 101, performs feature extraction on the sample data to obtain sample data features, determines whether the sample data is noise sample data according to the sample data features and preset noise data features, and performs deep network model training using the sample data in response to determining that the sample data is non-noise sample data.
It should be noted that the data processing method provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the data processing apparatus provided by the embodiment of the present disclosure may be disposed in the server 103. The data processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 103 and can communicate with the data acquisition terminal 101 and/or the server 103. Correspondingly, the data processing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 103 and is capable of communicating with the data acquisition terminal 101 and/or the server 103.
It should be understood that the number of data collection terminals, networks, and servers in fig. 1 is merely illustrative. There may be any number of data collection terminals, networks, and servers, as desired for implementation.
The embodiment of the present disclosure provides a data processing method, and a data processing method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 4 in conjunction with the system architecture of fig. 1. The data processing method of the embodiment of the present disclosure may be executed by the server 103 shown in fig. 1, for example.
Fig. 2 schematically shows a flow chart of a data processing method according to an embodiment of the present disclosure.
As shown in fig. 2, the data processing method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S230.
In operation S210, feature extraction is performed on the sample data to obtain sample data features.
In operation S220, it is determined whether the sample data is noise sample data according to the sample data characteristic and the preset noise data characteristic.
In operation S230, in response to determining that the sample data is non-noise sample data, deep network model training is performed using the sample data.
An exemplary flow of each operation of the data processing method of the present embodiment is illustrated below.
Sample data may illustratively be obtained in various public, legally compliant ways, such as from a public data set, or by a data collection end after obtaining user authorization associated with the sample data. The sample data is not scene data for a specific user and cannot reflect personal information of the specific user.
And the execution main body of the data processing method responds to the received sample data set, and performs feature extraction on the sample data in the sample data set to obtain sample data features. The sample data may be unlabeled data reflowed by the production environment, and the sample data set may include normal sample data and noise sample data.
Illustratively, in response to the received sample image set, feature extraction is performed on sample images in the sample image set, so as to obtain sample image features. The sample image features may include, for example, histogram features, texture features, haar features, or the like. The sample image set may include a normal sample image whose face quality score meets a preset requirement, or may include a low-quality face sample image or a non-face sample image whose face quality score does not meet the preset requirement, and the low-quality face sample image and the non-face sample image form a noise sample image.
The face quality score can be a quality score used for judging various factors such as face symmetry, definition, face shielding condition and the like. The face quality score may be determined in a number of ways, for example, the face quality score of a face sample image may be evaluated using a pre-trained neural network model.
And after the sample data characteristics associated with the sample data are obtained, determining whether the sample data is the noise sample data or not according to the sample data characteristics and the preset noise data characteristics. For example, the sample data may be determined to be noise sample data when the sample data characteristics and the noise-like data characteristics satisfy a preset similarity condition.
And in response to determining that the sample data is non-noise sample data, performing deep network model training by using the sample data characteristics. For example, when the face sample image is determined to be a non-noise sample image, the face sample image is used for deep network model training to obtain a face recognition model.
According to the similarity between the sample data features and the preset noise data features, whether the sample data is the noise sample data or not is determined, manual additional marking of the sample data is not needed, the screening efficiency of the noise sample data is improved effectively, and the screening cost consumption of the noise sample data is reduced effectively. And determining whether the sample data is noise sample data according to the similarity comparison result, thereby effectively realizing judgment and early warning for sample data abnormity.
And in response to determining that the sample data is the noise sample data, removing the noise sample data from the sample data set, and not using the noise sample data for unsupervised and semi-supervised deep network model training. In addition, the noise data characteristics can be updated by using the sample data characteristics of the noise sample data, and the filtering efficiency and the filtering precision aiming at the noise sample data can be effectively improved by continuously updating the noise data characteristics.
By the embodiment of the disclosure, the feature extraction is carried out on the sample data to obtain the sample data feature, whether the sample data is noise sample data or not is determined according to the sample data feature and the preset noise data feature, and in response to the determination that the sample data is non-noise sample data, the deep network model training is carried out by utilizing the sample data.
The method has the advantages that the noise sample data can be screened rapidly and accurately in the sample data set according to the predetermined noise data characteristics, the sample data can be marked without manpower additionally, the neural network model can be trained without additionally, the cost consumption and the operation complexity of screening the noise sample data can be effectively reduced, the screening efficiency of the noise sample data can be effectively improved, and the optimization efficiency and the training effect of the deep network model can be effectively improved.
Fig. 3 schematically illustrates a flow chart of a method of determining noise-like data characteristics according to an embodiment of the present disclosure.
As shown in FIG. 3, method 300 may include operations S310-S340, for example.
In operation S310, feature extraction is performed on at least one piece of target sample data to obtain a target data feature associated with each piece of target sample data in the at least one piece of target sample data.
In operation S320, a distribution state parameter of the target data feature associated with each piece of target sample data in the feature space is determined.
In operation S330, at least one piece of noise sample data is determined according to the distribution state parameter of the target data feature associated with each piece of target sample data.
In operation S340, a noise-like data characteristic is determined according to a target data characteristic associated with at least one piece of noise sample data.
An exemplary flow of operations of the method of determining a characteristic of noise-like data of the present embodiment is illustrated below.
For example, the target sample data may include uncalibrated sample data whose evaluation index value does not meet a preset condition, and the evaluation index value may be obtained based on at least one of the following operations: sample data quality evaluation, target object alignment, target object detection and target object identification.
For example, in the case that the sample data is a face sample image, the evaluation index value for the face sample image may be determined by detecting a face deflection angle in the face sample image, where the face deflection angle may include a deflection angle, a roll angle, and a pitch angle.
The uncalibrated sample data with the evaluation index value not meeting the preset condition may be sample data which cannot be accurately identified by the deep network model, and the noise data characteristics are determined based on the sample data, so that the screening efficiency and the screening precision of the noise sample data can be effectively improved.
The target sample data may also include the calibrated misjudgment sample data, fuzzy sample data, noise sample data, etc. And taking the sample data as seed data, and continuing subsequent determination operation of target sample data clustering and noise data characteristics, thereby being beneficial to further improving the screening efficiency and the screening precision of the noise sample data.
And performing feature extraction on at least one piece of target sample data to obtain target data features associated with each piece of target sample data in at least one piece of target sample data. And determining the distribution state parameters of the target data characteristics associated with each piece of target sample data in the feature space. The closer the distribution positions of the target data features in the feature space are, the higher the similarity between the corresponding target sample data is.
When determining the distribution state parameter of the target data feature associated with each piece of target sample data in the feature space, clustering at least one piece of target sample data according to the target data feature associated with each piece of target sample data to obtain at least one sample data cluster. And the sample data cluster to which each piece of target sample data belongs forms a distribution state parameter of the target data characteristic associated with the corresponding target sample data.
The method has the advantages that the distribution state parameters of the target data characteristics associated with the target sample data in the feature space are determined by clustering at least one piece of target sample data, so that the noise sample data is determined based on the distribution state parameters, convenience and wide applicability of screening the noise sample data are effectively improved, and the screening efficiency of the noise sample data is improved.
When at least one piece of target sample data is clustered, the target sample data with higher similarity of target data features is clustered to the same sample data cluster. For example, an unsupervised clustering algorithm may be used to cluster the at least one piece of target sample data, and the unsupervised clustering algorithm may include, for example, K-means clustering, hierarchical clustering, partitional clustering, single-chain clustering, and the like.
For example, initial categories corresponding to the target category number may be created based on a preset target category number, and a piece of target sample data may be randomly allocated to each initial category. And according to the target data characteristics associated with the target sample data and the central data characteristics associated with each initial category, allocating the target sample data to the initial category corresponding to the maximum characteristic similarity, and obtaining a cluster sample set associated with each initial category.
After a cluster sample set associated with each initial category is obtained, determining a central data feature associated with each initial category according to a target data feature of at least one target sample data in the cluster sample set. Similarity between any two initial categories is calculated, for example, similarity between the central data features associated with any two initial categories is calculated. And under the condition that any two initial categories meet the preset similarity condition, carrying out merging clustering processing on the clustering sample sets associated with the two initial categories. And continuing to execute the merged clustering processing aiming at the categories after the merged clustering until the similarity between any categories corresponding to the merged clustering result is less than a preset threshold value.
When at least one piece of noise sample data is determined according to the distribution state parameter of the target data feature associated with each piece of target sample data, the number of target samples included in each sample data cluster in the at least one sample data cluster may be determined. And determining a sample quantity threshold according to the target sample quantity included in each sample data cluster, wherein the sample quantity threshold is used for screening the noise sample data cluster. And taking the corresponding sample data cluster with the target sample number larger than the sample number threshold value as a noise sample data cluster to obtain at least one noise sample data cluster. And taking at least one piece of target sample data in each noise sample data cluster as noise sample data.
At least one sample data cluster is obtained by clustering at least one piece of target sample data, so that the screening efficiency of noise sample data is improved, the screening precision of the noise sample data is improved, and the accidental injury probability of normal sample data is reduced.
After a large amount of target sample data is clustered, the clustering result is usually distributed with a long tail. The target sample data in the cluster distribution header is typically noise sample data, and the cluster distribution header may be a sample data cluster with the largest number of target samples. For example, the target sample data in the cluster distribution header is typically a low-quality face sample image and a non-face sample image.
The sample number threshold may be determined based on the target number of samples in each sample data cluster. For example, according to the target sample number in each sample data cluster, a quartile of the target sample number can be calculated, and the included target sample number is larger than Q3+1.5IQR or Q3And the sample data cluster of +3IQR is taken as a noise sample data cluster.
The quartile is also called a quartile point, and the number of target samples in each sample data cluster is sorted from small to large and divided into quarters. The values at the three split points may be sorted at 1/4 (Q), respectively1)、2/4(Q2) And 3/4 (Q)3) Target number of bits, IQR at Q3Bit and Q1A difference in a target number of samples of bits.
For example, the sample number threshold may also be directly set, and a sample data cluster including a target sample number greater than the sample number threshold may be used as the noise sample data cluster.
When determining the noise-like data features according to the target data features associated with at least one piece of noise sample data, the central data features of each cluster of noise sample data may be determined according to the target data features associated with at least one piece of target sample data in each cluster of noise sample data. The central data features associated with each cluster of noise sample data constitute noise-like data features.
In an example manner, according to a target data feature associated with at least one piece of target sample data in a noise sample data cluster, an averaging or weighted averaging process may be performed on the target data feature to obtain a central data feature associated with a corresponding noise sample data cluster. For example, the noise sample cluster includes 100 noise sample images, and the noise sample feature associated with the noise sample images may be a 128-dimensional feature vector. And averaging or carrying out weighted average processing on the 100 128-dimensional feature vectors to obtain central data features associated with the corresponding noise sample data clusters.
For example, target sample data may be randomly selected from the noise sample data cluster, and a noise class tag associated with the noise sample data cluster may be determined according to a data class of the randomly selected target sample data. By determining the noise category label, the noise sample data can be preliminarily screened according to the noise category label, which is beneficial to further improving the screening efficiency of the noise sample data.
And clustering at least one piece of target sample data according to the target data characteristics associated with each piece of target sample data to obtain at least one sample data cluster. And determining at least one noise sample data cluster according to the number of target samples in each sample data cluster. And determining the central data characteristic associated with each noise sample data cluster according to the target data characteristic of at least one piece of target sample data in each noise sample data cluster. The central data feature associated with each cluster of noise sample data constitutes at least one noise-like data feature.
According to the sample data characteristics of the sample data and the at least one noise data characteristic, whether the sample data is the noise sample data or not is determined, automatic screening of the noise sample data can be achieved, the sample data does not need to be marked manually, and manpower cost consumption for cleaning the sample data is reduced effectively. The method can effectively accelerate the screening speed of the noise sample data, shorten the screening time of the noise sample data, and effectively improve the cleaning efficiency of the sample data. The noise data characteristics are determined based on the clustering result, the screening quality of noise sample data can be effectively improved, the reliability of a sample data cleaning result is effectively ensured, and the optimization efficiency and the training effect of the deep network model are effectively improved.
Fig. 4 schematically shows a schematic diagram of a data processing procedure according to an embodiment of the present disclosure.
As shown in fig. 4, in the data processing process 400, feature extraction is performed on at least one piece of target sample data in the target sample data set 401, so as to obtain a target data feature associated with each piece of target sample data. And clustering at least one piece of target sample data according to the target data characteristics associated with each piece of target sample data to obtain sample data clusters 402, 403, 404, 405 and 406.
According to the preset sample number threshold and the target sample number in the sample data clusters 402, 403, 404, 405 and 406, the sample data clusters 402, 403 and 404 with the target sample number larger than the sample number threshold are used as noise sample data clusters. According to the target data characteristics of at least one piece of target sample data in the noise sample data clusters 402, 403 and 404, central data characteristics respectively associated with the noise sample data clusters 402, 403 and 404 are determined. The central data features associated with the noise sample data clusters 402, 403, 404 constitute noise-like data features 4021, 4031, 4041, respectively.
When the sample data set 407 is cleaned, feature extraction is performed on at least one piece of sample data in the sample data set 407, so as to obtain sample data features associated with each piece of sample data. According to the sample data features and the noise data features 4021, 4031, 4041 associated with each piece of sample data, it is determined whether the corresponding sample data is noise sample data, so as to implement the division of the sample data set 407 into noise sample data 408 and non-noise sample data 409. Non-noise sample data 409 is used to train the deep network model and noise sample data 408 will be discarded and not used to train the deep network model.
The method has the advantages that the sample data can be marked without manpower, the automatic cleaning of the sample data is facilitated, and the method is well suitable for deep network model training with large data volume of a training set. The noise data characteristics are determined based on the clustering result, the screening quality of noise sample data can be effectively improved, the reliability of a sample data cleaning result is effectively ensured, the method is favorable for realizing that unmarked sample data refluxed in a production environment is used for semi-supervised and unsupervised deep network model training, and the optimization efficiency and the training effect of the deep network model can be effectively improved.
Fig. 5 schematically shows a block diagram of a data processing device according to an embodiment of the present disclosure.
As shown in fig. 5, the data processing apparatus 500 of the embodiment of the present disclosure includes, for example, a first processing module 510, a second processing module 520, and a third processing module 530.
A first processing module 510, configured to perform feature extraction on sample data to obtain sample data features; the second processing module 520 is configured to determine whether the sample data is noise sample data according to the sample data characteristics and preset noise data characteristics; a third processing module 530, configured to perform deep network model training using the sample data in response to determining that the sample data is non-noise sample data.
By the embodiment of the disclosure, the feature extraction is performed on the sample data to obtain the sample data feature, whether the sample data is the noise sample data is determined according to the sample data feature and the preset noise data feature, and the deep network model training is performed by using the sample data in response to the determination that the sample data is the non-noise sample data.
The method has the advantages that the noise sample data can be screened rapidly and accurately in the sample data set according to the predetermined noise data characteristics, the sample data can be marked without manpower additionally, the neural network model can be trained without additionally, the cost consumption and the operation complexity of screening the noise sample data can be effectively reduced, the screening efficiency of the noise sample data can be effectively improved, and the optimization efficiency and the training effect of the deep network model can be effectively improved.
According to an embodiment of the present disclosure, the apparatus further includes a fourth processing module, configured to determine a noise-like data characteristic, where the fourth processing module includes: the first processing submodule is used for extracting the characteristics of at least one piece of target sample data to obtain target data characteristics associated with each piece of target sample data in at least one piece of target sample data; the second processing submodule is used for determining the distribution state parameters of the target data characteristics associated with each piece of target sample data in the feature space; the third processing submodule is used for determining at least one piece of noise sample data according to the distribution state parameter of the target data characteristic associated with each piece of target sample data; and a fourth processing submodule, configured to determine a noise-like data feature according to a target data feature associated with at least one piece of noise sample data.
According to an embodiment of the present disclosure, the second processing submodule includes: the first processing unit is used for clustering at least one piece of target sample data according to the target data characteristics associated with each piece of target sample data to obtain at least one sample data cluster; and the second processing unit is used for forming the distribution state parameter of the target data characteristic associated with the corresponding target sample data in the sample data cluster to which each piece of target sample data belongs.
According to an embodiment of the present disclosure, the third processing submodule includes: the third processing unit is used for determining the number of target samples included in each sample data cluster in at least one sample data cluster; the fourth processing unit is used for determining a sample quantity threshold according to the target sample quantity included in each sample data cluster, and the sample quantity threshold is used for screening the noise sample data cluster; the fifth processing unit is used for taking the corresponding sample data cluster with the target sample number larger than the sample number threshold as a noise sample data cluster to obtain at least one noise sample data cluster; and a sixth processing unit, configured to use at least one target sample data in each noise sample data cluster as the noise sample data.
According to an embodiment of the present disclosure, the fourth processing submodule includes: the seventh processing unit is used for determining the central data characteristic of each noise sample data cluster according to the target data characteristic associated with at least one piece of target sample data in each noise sample data cluster; and an eighth processing unit, configured to construct noise-like data features from the central data features associated with each cluster of noise sample data.
According to the embodiment of the disclosure, the device further comprises a fifth processing module, configured to randomly select target sample data from the noise sample data cluster; and determining a noise category label associated with the noise sample data cluster according to the data category of the randomly selected target sample data.
According to the embodiment of the disclosure, the target sample data comprises sample data with an evaluation index value which does not meet a preset condition, and the evaluation index value is obtained based on at least one of the following operations: sample data quality evaluation, target object alignment, target object detection and target object identification.
According to an embodiment of the present disclosure, the second processing module includes: the fifth processing submodule is used for determining the sample data as the noise sample data under the condition that the characteristics of the sample data and the characteristics of the noise data meet the preset similarity condition; the device also includes: and the sixth processing module is used for responding to the fact that the sample data is determined to be the noise sample data, and updating the noise data characteristics by using the sample data characteristics.
It should be noted that in the technical solutions of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the related information are all in accordance with the regulations of the related laws and regulations, and do not violate the customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
Fig. 6 schematically shows a block diagram of an electronic device for performing a data processing method according to an embodiment of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. The electronic device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the data processing method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with an object, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to an object; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which objects can provide input to the computer. Other kinds of devices may also be used to provide for interaction with an object; for example, feedback provided to the subject can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the object may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., an object computer having a graphical object interface or a web browser through which objects can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (19)

1. A method of data processing, comprising:
performing feature extraction on the sample data to obtain sample data features;
determining whether the sample data is noise sample data or not according to the sample data characteristics and preset noise data characteristics; and
and in response to determining that the sample data is non-noise sample data, performing deep network model training using the sample data.
2. The method of claim 1, wherein determining the noise-like data characteristic comprises:
performing feature extraction on at least one piece of target sample data to obtain target data features associated with each piece of target sample data in the at least one piece of target sample data;
determining a distribution state parameter of a target data characteristic associated with each piece of target sample data in a characteristic space;
determining at least one piece of noise sample data according to the distribution state parameter of the target data characteristic associated with each piece of target sample data; and
and determining the noise-like data characteristics according to the target data characteristics associated with the at least one piece of noise sample data.
3. The method of claim 2, wherein said determining a distribution state parameter of a target data feature associated with said each piece of target sample data in a feature space comprises:
clustering the at least one piece of target sample data according to the target data characteristics associated with each piece of target sample data to obtain at least one sample data cluster; and
and the sample data cluster to which each piece of target sample data belongs forms a distribution state parameter of the target data characteristic associated with the corresponding target sample data.
4. The method of claim 3, wherein said determining at least one piece of noise sample data from distribution state parameters of target data features associated with said each piece of target sample data comprises:
determining the number of target samples included in each sample data cluster in the at least one sample data cluster;
determining a sample number threshold according to the target sample number included in each sample data cluster, wherein the sample number threshold is used for screening the noise sample data clusters;
taking the corresponding sample data cluster with the target sample number larger than the sample number threshold value as a noise sample data cluster to obtain at least one noise sample data cluster; and
and taking at least one piece of target sample data in each noise sample data cluster as noise sample data.
5. The method of claim 4, wherein said determining the noise-like data characteristic from a target data characteristic associated with the at least one noise sample data comprises:
determining the central data characteristic of each noise sample data cluster according to the target data characteristic associated with at least one piece of target sample data in each noise sample data cluster; and
and the central data characteristic associated with each noise sample data cluster forms the noise data characteristic.
6. The method of claim 4, further comprising:
randomly selecting target sample data in the noise sample data cluster; and
and determining a noise category label associated with the noise sample data cluster according to the data category of the randomly selected target sample data.
7. The method of any one of claims 2 to 6,
the target sample data comprises sample data with an evaluation index value not meeting a preset condition, wherein the evaluation index value is obtained based on at least one of the following operations:
sample data quality evaluation, target object alignment, target object detection and target object identification.
8. The method of claim 1, wherein,
determining whether the sample data is noise sample data according to the sample data characteristics and preset noise data characteristics, wherein the step of determining whether the sample data is noise sample data comprises the following steps:
under the condition that the sample data characteristics and the noise data characteristics meet a preset similarity condition, determining the sample data as noise sample data;
the method further comprises the following steps:
in response to determining that the sample data is noise sample data, updating the noise-like data characteristics with the sample data characteristics.
9. A data processing apparatus comprising:
the first processing module is used for extracting the characteristics of the sample data to obtain the characteristics of the sample data;
the second processing module is used for determining whether the sample data is noise sample data or not according to the sample data characteristics and preset noise data characteristics; and
and the third processing module is used for responding to the fact that the sample data is determined to be non-noise sample data and utilizing the sample data to conduct deep network model training.
10. The apparatus of claim 9, wherein the apparatus further comprises a fourth processing module for determining the noise-like data characteristic, the fourth processing module comprising:
the first processing submodule is used for extracting the characteristics of at least one piece of target sample data to obtain target data characteristics associated with each piece of target sample data in the at least one piece of target sample data;
the second processing submodule is used for determining the distribution state parameters of the target data characteristics associated with each piece of target sample data in the feature space;
a third processing sub-module, configured to determine at least one piece of noise sample data according to a distribution state parameter of a target data feature associated with each piece of target sample data; and
and the fourth processing submodule is used for determining the noise data characteristics according to the target data characteristics associated with the at least one piece of noise sample data.
11. The apparatus of claim 10, wherein the second processing submodule comprises:
the first processing unit is used for clustering the at least one piece of target sample data according to the target data characteristics associated with each piece of target sample data to obtain at least one sample data cluster; and
and the second processing unit is used for forming a distribution state parameter of the target data characteristic associated with the corresponding target sample data in the sample data cluster to which each piece of target sample data belongs.
12. The apparatus of claim 11, wherein the third processing sub-module comprises:
a third processing unit, configured to determine a number of target samples included in each sample data cluster in the at least one sample data cluster;
a fourth processing unit, configured to determine a sample number threshold according to a target sample number included in each sample data cluster, where the sample number threshold is used to screen a noise sample data cluster;
a fifth processing unit, configured to use the corresponding sample data cluster whose target sample number is greater than the sample number threshold as a noise sample data cluster, to obtain at least one noise sample data cluster; and
and the sixth processing unit is used for taking at least one piece of target sample data in each noise sample data cluster as the noise sample data.
13. The apparatus of claim 12, wherein the fourth processing submodule comprises:
a seventh processing unit, configured to determine, according to a target data feature associated with at least one piece of target sample data in each noise sample data cluster, a central data feature of each noise sample data cluster; and
and the eighth processing unit is used for forming the noise-like data characteristics by using the central data characteristics associated with each noise sample data cluster.
14. The apparatus of claim 12, further comprising a fifth processing module to:
randomly selecting target sample data from the noise sample data cluster; and
and determining a noise category label associated with the noise sample data cluster according to the data category of the randomly selected target sample data.
15. The apparatus of any one of claims 10 to 14,
the target sample data comprises sample data with an evaluation index value not meeting a preset condition, wherein the evaluation index value is obtained based on at least one of the following operations:
sample data quality evaluation, target object alignment, target object detection and target object identification.
16. The apparatus of claim 9, wherein the second processing module comprises:
the fifth processing submodule is used for determining the sample data as noise sample data under the condition that the sample data characteristics and the noise data characteristics meet the preset similarity condition;
the device further comprises:
and the sixth processing module is used for responding to the fact that the sample data is determined to be noise sample data, and updating the noise data characteristics by using the sample data characteristics.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of claims 1-8.
19. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 8.
CN202210097295.3A 2022-01-26 2022-01-26 Data processing method and device, equipment, medium and product Pending CN114429581A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210097295.3A CN114429581A (en) 2022-01-26 2022-01-26 Data processing method and device, equipment, medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210097295.3A CN114429581A (en) 2022-01-26 2022-01-26 Data processing method and device, equipment, medium and product

Publications (1)

Publication Number Publication Date
CN114429581A true CN114429581A (en) 2022-05-03

Family

ID=81313660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210097295.3A Pending CN114429581A (en) 2022-01-26 2022-01-26 Data processing method and device, equipment, medium and product

Country Status (1)

Country Link
CN (1) CN114429581A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414946A (en) * 2020-03-12 2020-07-14 腾讯科技(深圳)有限公司 Artificial intelligence-based medical image noise data identification method and related device
CN111414952A (en) * 2020-03-17 2020-07-14 腾讯科技(深圳)有限公司 Noise sample identification method, device, equipment and storage medium for pedestrian re-identification
WO2021068563A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Sample date processing method, device and computer equipment, and storage medium
CN112732690A (en) * 2021-01-05 2021-04-30 山东福来克思智能科技有限公司 Stabilizing system and method for chronic disease detection and risk assessment
US20210133602A1 (en) * 2019-11-04 2021-05-06 International Business Machines Corporation Classifier training using noisy samples
CN113361603A (en) * 2021-06-04 2021-09-07 北京百度网讯科技有限公司 Training method, class recognition device, electronic device and storage medium
KR20210124111A (en) * 2021-03-25 2021-10-14 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Method and apparatus for training model, device, medium and program product
CN113627361A (en) * 2021-08-13 2021-11-09 北京百度网讯科技有限公司 Training method and device for face recognition model and computer program product
CN113705598A (en) * 2021-03-04 2021-11-26 腾讯科技(北京)有限公司 Data classification method and device and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068563A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Sample date processing method, device and computer equipment, and storage medium
US20210133602A1 (en) * 2019-11-04 2021-05-06 International Business Machines Corporation Classifier training using noisy samples
CN111414946A (en) * 2020-03-12 2020-07-14 腾讯科技(深圳)有限公司 Artificial intelligence-based medical image noise data identification method and related device
CN111414952A (en) * 2020-03-17 2020-07-14 腾讯科技(深圳)有限公司 Noise sample identification method, device, equipment and storage medium for pedestrian re-identification
CN112732690A (en) * 2021-01-05 2021-04-30 山东福来克思智能科技有限公司 Stabilizing system and method for chronic disease detection and risk assessment
CN113705598A (en) * 2021-03-04 2021-11-26 腾讯科技(北京)有限公司 Data classification method and device and electronic equipment
KR20210124111A (en) * 2021-03-25 2021-10-14 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Method and apparatus for training model, device, medium and program product
CN113361603A (en) * 2021-06-04 2021-09-07 北京百度网讯科技有限公司 Training method, class recognition device, electronic device and storage medium
CN113627361A (en) * 2021-08-13 2021-11-09 北京百度网讯科技有限公司 Training method and device for face recognition model and computer program product

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
朱俊勇;逯峰;: "基于稀疏子空间聚类的跨域人脸迁移学习方法", 中山大学学报(自然科学版), no. 05, 15 September 2016 (2016-09-15) *
李湘东;巴志超;黄莉;: "文本分类中基于类别数据分布特性的噪声处理方法", 现代图书情报技术, no. 11, 25 November 2014 (2014-11-25) *
王溢琴;: "改进的模糊聚类PCA算法对噪声数据的处理", 电脑开发与应用, no. 03, 5 March 2009 (2009-03-05) *

Similar Documents

Publication Publication Date Title
CN113705425B (en) Training method of living body detection model, and method, device and equipment for living body detection
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN112633276B (en) Training method, recognition method, device, equipment and medium
CN112949710A (en) Image clustering method and device
CN112949767A (en) Sample image increment, image detection model training and image detection method
CN113221768A (en) Recognition model training method, recognition method, device, equipment and storage medium
CN112800919A (en) Method, device and equipment for detecting target type video and storage medium
CN114882315B (en) Sample generation method, model training method, device, equipment and medium
CN113191261A (en) Image category identification method and device and electronic equipment
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN113963011A (en) Image recognition method and device, electronic equipment and storage medium
CN113989300A (en) Lane line segmentation method and device, electronic equipment and storage medium
CN113887630A (en) Image classification method and device, electronic equipment and storage medium
CN115457329B (en) Training method of image classification model, image classification method and device
CN114973333B (en) Character interaction detection method, device, equipment and storage medium
CN113361455B (en) Training method of face counterfeit identification model, related device and computer program product
CN114429581A (en) Data processing method and device, equipment, medium and product
CN116363444A (en) Fuzzy classification model training method, fuzzy image recognition method and device
CN115631370A (en) Identification method and device of MRI (magnetic resonance imaging) sequence category based on convolutional neural network
CN114882334A (en) Method for generating pre-training model, model training method and device
CN113139483A (en) Human behavior recognition method, apparatus, device, storage medium, and program product
CN113947195A (en) Model determination method and device, electronic equipment and memory
CN115809687A (en) Training method and device for image processing network
CN115146725B (en) Method for determining object classification mode, object classification method, device and equipment
CN114155589B (en) Image processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination