CN110472082A - Data processing method, device, storage medium and electronic equipment - Google Patents

Data processing method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110472082A
CN110472082A CN201910713732.8A CN201910713732A CN110472082A CN 110472082 A CN110472082 A CN 110472082A CN 201910713732 A CN201910713732 A CN 201910713732A CN 110472082 A CN110472082 A CN 110472082A
Authority
CN
China
Prior art keywords
data
cluster
model
clusters
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910713732.8A
Other languages
Chinese (zh)
Other versions
CN110472082B (en
Inventor
罗彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinsheng Communication Technology Co Ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Shanghai Jinsheng Communication Technology Co Ltd
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinsheng Communication Technology Co Ltd, Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Shanghai Jinsheng Communication Technology Co Ltd
Priority to CN201910713732.8A priority Critical patent/CN110472082B/en
Publication of CN110472082A publication Critical patent/CN110472082A/en
Application granted granted Critical
Publication of CN110472082B publication Critical patent/CN110472082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of data processing method, device, storage medium and electronic equipments.The data processing method includes: to obtain multiple data, and multiple data carry identical class label;The feature for extracting each data obtains multiple data characteristicses;Clustering processing is carried out to multiple data characteristics, obtains cluster result;According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein the first data clusters cluster is the cluster not needed where the data of cleaning, which is the cluster where the data for needing to clean;Data cleansing processing is carried out to the data in the second data clusters cluster, obtains reduced data;According to the data and the reduced data in the first data clusters cluster, target data is obtained.The efficiency of data cleansing can be improved in the application.

Description

Data processing method, device, storage medium and electronic equipment
Technical field
The application belongs to data technique field more particularly to a kind of data processing method, device, storage medium and electronics are set It is standby.
Background technique
Data cleansing, which refers to the process of, to be examined and is verified again to data, and its object is to by the mistake in data set Information deletion.By taking the data cleansing of category images processing as an example, whether the tag along sort of mainly inspection picture is correct, and will divide The picture of class tag error is deleted.However, in the related technology, the efficiency of data cleansing processing is lower.
Summary of the invention
The embodiment of the present application provides a kind of data processing method, device, storage medium and electronic equipment, and data can be improved The efficiency of cleaning.
The embodiment of the present application provides a kind of data processing method, comprising:
Multiple data are obtained, the multiple data carry identical class label;
The feature for extracting each data, obtains multiple data characteristicses;
Clustering processing is carried out to the multiple data characteristics, obtains cluster result;
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data Clustering cluster is the cluster not needed where the data of cleaning, and the second data clusters cluster is the cluster where the data for needing to clean;
Data cleansing processing is carried out to the data in the second data clusters cluster, obtains reduced data;
According in the first data clusters cluster data and the reduced data, obtain target data.
The embodiment of the present application provides a kind of data processing equipment, comprising:
Module is obtained, for obtaining multiple data, the multiple data carry identical class label;
Extraction module obtains multiple data characteristicses for extracting the feature of each data;
Cluster module obtains cluster result for carrying out clustering processing to the multiple data characteristics;
Determining module, for determining the first data clusters cluster and the second data clusters cluster according to the cluster result, wherein The first data clusters cluster is the cluster not needed where the data of cleaning, and the second data clusters cluster is the number for needing to clean According to the cluster at place;
First processing module obtains for carrying out data cleansing processing to the data in the second data clusters cluster Handle data;
Second processing module, for according in the first data clusters cluster data and the reduced data, obtain Target data.
The embodiment of the present application provides a kind of storage medium, is stored thereon with computer program, when the computer program exists When being executed on computer, so that the computer executes the process in data processing method provided by the embodiments of the present application.
The embodiment of the present application also provides a kind of electronic equipment, including memory, and processor, the processor is by calling institute The computer program stored in memory is stated, for executing the process in data processing method provided by the embodiments of the present application.
In the present embodiment, electronic equipment can use clustering processing to carry out data cleansing work.Due to using at cluster Reason can quickly determine out the wrong data of class label, and by the electronic equipment data wrong to this partial category label into Row data cleaning treatment.Therefore, the present embodiment can be quickly obtained clean data.Compared in the related technology by manually one by one Browsing checks the whether wrong data cleansing mode of the label information of data, and the efficiency of data cleansing can be improved in the present embodiment.
Detailed description of the invention
With reference to the accompanying drawing, it is described in detail by the specific embodiment to the application, the technical solution of the application will be made And its advantages are apparent.
Fig. 1 is the flow diagram of data processing method provided by the embodiments of the present application.
Fig. 2 is another flow diagram of data processing method provided by the embodiments of the present application.
Fig. 3 is the schematic diagram of hierarchical clustering figure provided by the embodiments of the present application.
Fig. 4 is the structural schematic diagram of the Feature Selection Model provided by the embodiments of the present application formed by multiple Model Fusions.
Fig. 5 to Figure 10 is the schematic diagram of a scenario of data processing method provided by the embodiments of the present application.
Figure 11 is the structural schematic diagram of data processing equipment provided by the embodiments of the present application.
Figure 12 is the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Figure 13 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Diagram is please referred to, wherein identical component symbol represents identical component, the principle of the application is to implement one It is illustrated in computing environment appropriate.The following description be based on illustrated by the application specific embodiment, should not be by It is considered as limitation the application other specific embodiments not detailed herein.
It is understood that the executing subject of the embodiment of the present application can be the electricity of smart phone or tablet computer etc. Sub- equipment.
Referring to Fig. 1, Fig. 1 is the flow diagram of data processing method provided by the embodiments of the present application, process be can wrap It includes:
101, multiple data are obtained, multiple data carry identical class label.
Data cleansing, which refers to the process of, to be examined and is verified again to data, and its object is to by the mistake in data set Information deletion.By taking the data cleansing of category images processing as an example, in the related technology mainly by way of manual inspection come into Row data cleansing.For example, checking whether the tag along sort of picture is correct by manually, and the picture of tag along sort mistake is deleted It removes.However, in the related technology, the efficiency of data cleansing processing is lower.
In the 101 of the embodiment of the present application, electronic equipment can first obtain multiple data, these data can carry phase Same class label.It is understood that multiple data are the data for needing to carry out data cleansing.For example, electronic equipment An available data set for needing to carry out data cleansing.
For example, it is desired to which the data for carrying out data cleansing are a pictures, the picture for including in the pictures can be tool There is the picture of same category label.For example, the class label for the picture for including in the pictures is flowers classification etc..
102, the feature for extracting each data, obtains multiple data characteristicses.
For example, after getting the data for needing to carry out data cleansing processing, electronic equipment can extract wherein each The feature of data, to obtain multiple data characteristicses.
For example, electronic equipment gets the pictures P for needing to carry out data cleansing processing, wherein include in pictures P Picture is expressed as Pi, i is the integer more than or equal to 1.Later, electronic equipment can extract each figure in pictures P Piece PiFeature, obtain and each picture character pair Fi.Wherein, feature FiIt is picture PiFeature.
103, clustering processing is carried out to multiple data characteristicses, obtains cluster result.
104, according to cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data are poly- Class cluster is the cluster not needed where the data of cleaning, which is the cluster where the data for needing to clean.
For example, electronic equipment can be to multiple number after obtaining multiple data characteristicses corresponding with above-mentioned multiple data Clustering processing is carried out according to feature, to obtain cluster result.Later, electronic equipment can determine according to the cluster result One data clusters cluster and the second data clusters cluster.Wherein, which is the cluster not needed where the data of cleaning, The second data clusters cluster is the cluster where the data for needing to clean.That is, cluster when, electronic equipment with data characteristics be according to According to being clustered to data (sample).Wherein, the second data clusters cluster may include at least one cluster.
It is made of it should be noted that clustering processing refers to that set by physics or abstract object is divided into similar object The process of multiple classes.Object by clustering the set that cluster generated is one group of data object, in these objects and the same cluster It is similar to each other, it is different with the object in other clusters.
For example, including 1000 pictures, respectively P in pictures P1、P2、P3... ..., P1000.This 1000 picture is corresponding Data characteristics be followed successively by F1、F2、F3... ..., F1000.So, electronic equipment can be to data characteristics F1、F2、F3... ..., F1000 Clustering processing is carried out, to obtain corresponding cluster result.
For example, electronic equipment can determine first according to the cluster result after obtaining the cluster result of data characteristics Data clusters cluster and the second data clusters cluster.Wherein, which is the cluster not needed where the picture of cleaning, should Second data clusters cluster is the cluster where the picture for needing to clean.That is, the picture for including in the first data clusters cluster Class label be judged as correctly, the class label for the picture for including in the second data clusters cluster be judged as there may be Mistake.
For example, obtaining data characteristics F1、F2、F3... ..., F1000Cluster result after, electronic equipment can determine therefrom that First data clusters cluster and the second data clusters cluster out, wherein the class label quilt for the picture for including in the first data clusters cluster It is considered correctly, and the class label for the picture for including in the second data clusters cluster is considered as that there may be mistakes.It needs It is noted that the class label for the data for including in the second data clusters cluster actually may be strictly mistake, it is also possible to Actually correctly.
105, data cleansing processing is carried out to the data in the second data clusters cluster, obtains reduced data.
106, according to the data and reduced data in the first data clusters cluster, target data is obtained.
For example, electronic equipment can be to the second data after determining the first data clusters cluster and the second data clusters cluster The data for including in clustering cluster carry out data cleansing processing, to obtain reduced data.Later, electronic equipment can be according to The data and the reduced data for including in one data clusters cluster obtain target data.It is understood that the target data is For the clean data obtained after data cleansing processing.
For example, including 5 pictures, respectively P in the second data clusters cluster7、P21、P81、P200And P751.So, electronics is set It is standby can be to P7、P21、P81、P200And P751Carry out data cleansing processing.For example, electronic equipment determines P in this 5 picture7、 P81、P200Class label it is errorless, and P21And P751Class label it is wrong.So, electronic equipment can be by P21And P751This two Picture is deleted, thus obtain reduced data, i.e. picture P7、P81、P200
Later, picture and P that electronic equipment can will include in the first data clusters cluster7、P81、P200Merge, obtains data The correct pictures of cleaned pictures, i.e. class label.
It is understood that electronic equipment can use clustering processing to carry out data cleansing work in the present embodiment.By In the wrong data of class label can be quickly determined out using clustering processing, and by electronic equipment to this partial category label Wrong data carry out data cleansing processing.Therefore, the present embodiment can be quickly obtained clean data.Compared to the relevant technologies In the whether wrong data cleansing modes of label information of data is checked by manually browsing one by one, data can be improved in the present embodiment The efficiency of cleaning.
Referring to Fig. 2, Fig. 2 is another flow diagram of data processing method provided by the embodiments of the present application, process can To include:
In 201, electronic equipment obtains multiple data, and multiple data carry identical class label.
For example, including 1000 in the electronic equipment available pictures P for needing to carry out data cleansing, pictures P Picture.This 1000 photos carry identical class label.For example, this 1000 picture has the identical flower manually marked Grass class label.For example, 1000 pictures for including in pictures P are respectively P1、P2、P3... ..., P1000.That is, P={ P1、P2、 P3... ..., P1000}。
In 202, electronic equipment carries out feature extraction to each data using default Feature Selection Model, obtains multiple numbers According to feature.
For example, electronic equipment can use default Feature Selection Model in pictures P after getting pictures P Each picture carry out feature extraction, to obtain multiple picture features.
For example, electronic equipment can extract each picture P in pictures PiFeature, obtain and each picture pair Answer feature Fi.Wherein, feature FiIt is picture PiFeature, i is integer more than or equal to 1.For example, by picture feature FiIt constitutes One feature set F.That is, F={ F1、F2、F3... ..., F1000, wherein F1It is picture P1Feature, F2It is picture P2Feature, F3 It is picture P3Feature, etc., and so on.
In one embodiment, electronic equipment can obtain in the following way default Feature Selection Model:
When the multiple data are picture, electronic equipment obtains the first model, which is according to ImageNet The ResNet model that training obtains;
Electronic equipment carries out learning training to ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model by electronic equipment, and by the third mould Type is determined as default Feature Selection Model.
For example, when the data for needing to carry out data cleansing are picture, electronic equipment can when the multiple data are picture First to obtain the first model, wherein first model is the ResNet model obtained according to ImageNet training.
It should be noted that ImageNet project is a large-scale visualization number for the research of visual object identification software According to library.Image URL more than 14,000,000 is by ImageNet manual annotations, to indicate the object in picture.Since two thousand and ten, ImageNet project holds a software match, the i.e. extensive visual identity challenge match (ILSVRC) of ImageNet, software every year Program competitively correct classification and Detection object and scene.
ResNet (Residual Neural Network) has successfully trained 152 layers by using ResNet Unit Neural network, and champion is obtained in ILSVRC2015 match.The instruction for the accelerans network that the structure of ResNet can be exceedingly fast Practice, the accuracy rate of model also has bigger promotion.
That is, ImageNet is an open, free large-scale picture database, wherein containing 2.2 all creations Category images.And ResNet is then one with the trained picture classification model of data in ImageNet.
For example, electronic equipment can be first with the picture pair for needing to carry out data cleansing after getting ResNet model ResNet model carries out machine learning training, to obtain the second model.After obtaining the second model, electronic equipment can be by position It is removed in the full articulamentum of the second model the last layer, to obtain third model, and the third model is determined as default spy Sign extracts model.It should be noted that the last layer of ResNet model is full articulamentum, the work of full articulamentum in a model Classified to picture with being, and other neural net layers in the ResNet model in addition to the full articulamentum of the last layer Effect is to extract feature, therefore the neural net layer obtained after the full articulamentum of the last layer of the second model is removed can be used In as Feature Selection Model.In addition, why to be carried out again using the picture for needing to carry out data cleansing to ResNet model Learning training is because ResNet is a more general disaggregated model, with the picture pair for needing to carry out data cleansing ResNet model carries out a learning training again and obtains the second model, and the second model can be made to needing to carry out data cleansing The classification of picture is more targeted, so that third model is more quasi- to the feature extraction for the picture for needing to carry out data cleansing Really.
In 203, electronic equipment utilizes hierarchical clustering algorithm, carries out clustering processing to multiple data characteristicses, is clustered As a result, wherein measure the distance between sample using Ming Shi distance in clustering processing, take and belong to the two of inhomogeneous sample Between class distance of the distance between two mean value as two classes.
For example, electronic equipment can use hierarchical clustering calculation after the feature for extracting each picture obtains feature set F Method carries out clustering processing to the feature in feature set F, obtains cluster result.That is, electronic equipment is in cluster with data characteristics For foundation, data (picture) are clustered.Wherein, electronic equipment is when carrying out hierarchical clustering processing using hierarchical clustering algorithm The distance between sample is measured using Ming Shi distance, and takes the mean value for belonging to the distance between any two of inhomogeneous sample Between class distance as two classes.
It should be noted that Ming Shi distance is that one of Euclidean space is estimated, it is defined as two o'clock P=(x1, x2..., xn) and Q=(y1, y2..., yn), the Ming Shi distance between them isWherein P is positive integer. When carrying out hierarchical clustering, take between class of the mean value for the distance between any two for belonging to two inhomogeneous samples as two classes Distance (taking the mean value of two classes as between class distance), i.e.,Wherein, davgTable Show between class distance, CiIndicate a class, CjIndicate another class, | Ci| indicate CiThe number of sample in this class, | Cj| indicate Cj The number of sample in this class, dist (x, z) are Ming Shi distance.That is, judging that can two classes be polymerized to one carrying out hierarchical clustering When a class, electronic equipment can calculate the Ming Shi distance of each of each of Ci sample and Cj sample, to obtain Multiple distance values, and using the mean value of this multiple distance value as the between class distance of the two classes.If the distance between certain two class Less than the distance between any other two classes, then the two classes are polymerized to one kind.
It should be noted that hierarchical clustering (Hierarchical Clustering) is a kind of cluster calculation based on prototype Method, it is intended to data set be divided in different levels, to form tree-like cluster structure.The division of data set can be used " Aggregation strategy from bottom to top " can also use the partition strategy of " from top to bottom ".The advantage of hierarchical clustering algorithm is, can To help us using visual mode to explain cluster result by drawing dendrogram (Dendrogram).Hierarchical clustering Another advantage be that it does not need the quantity of specified cluster in advance.
In 204, electronic equipment obtains segmentation threshold, and determines that the first data are poly- according to cluster result and the segmentation threshold Class cluster and the second data clusters cluster, wherein the cluster result is hierarchical clustering figure, which is used for from the hierarchical clustering figure In select required clustering cluster, the feature quantity for including in the first data clusters cluster is denoted as the first quantity, second data The feature quantity for including in clustering cluster is denoted as the second quantity, and the difference of first quantity and second quantity is greater than first threshold, The first data clusters cluster is the cluster not needed where the data of cleaning, the data institute which cleans for needs Cluster.
For example, electronics is set can after carrying out clustering processing to feature set F using hierarchical clustering algorithm and obtaining cluster result To determine the first data clusters cluster and the second data clusters cluster according to the cluster result.For example, in the present embodiment, at cluster Managing the cluster result for using hierarchical clustering algorithm, therefore obtaining is hierarchical clustering figure.Hierarchical clustering figure (arborescence) can be with As shown in Figure 3.
After obtaining hierarchical clustering figure, the available segmentation threshold of electronic equipment, the segmentation threshold is for from the layer The numerical value of required clustering cluster is selected in secondary dendrogram.For example, as shown in figure 3, to include 7 sample R in sample set0、R1、 R2、R3、R4、R5And R6For.It, can be from tree-like when segmentation threshold takes 1.8 (i.e. the corresponding dotted line of longitudinal axis numerical value 1.8 in Fig. 3) 5 clusters are selected in figure, this 5 clusters are respectively { R0}、{R1}、{R2}、{R3}、{R4}、{R5And R6}.When segmentation threshold takes When 3.5 (the corresponding dotted line of longitudinal axis numerical value 3.5 in Fig. 3), 3 clusters can be selected from arborescence, this 3 clusters are respectively {R0、R1、R2}、{R3}、{R4、R5And R6}.It, can when segmentation threshold takes 4.5 (i.e. the corresponding dotted line of longitudinal axis numerical value 4.5 in Fig. 3) To select 2 clusters from arborescence, this 2 clusters are respectively { R0、R1、R2}、{R3、R4、R5And R6}.As it can be seen that segmentation threshold Numerical value it is bigger, the quantity of finally obtained clustering cluster is fewer.
In one embodiment, segmentation threshold can be by manually determining and being input in electronic equipment, and divides threshold The selection of value can follow following principle: first, obtained clustering cluster is no more than 10;Second, at least one clustering cluster packet The characteristic contained is significantly higher than other clustering clusters.
In the present embodiment, the first data clusters cluster determined according to cluster result and segmentation threshold and the second data Clustering cluster can satisfy following condition: the sample size for including in the first data clusters cluster is denoted as the first quantity, second number It is denoted as the second quantity according to the sample size for including in clustering cluster, the difference of first quantity and second quantity is greater than the first threshold Value.That is, the sample size for including in the first data clusters cluster is noticeably greater than the sample size in the second data clusters cluster included.
Wherein, which is the cluster not needed where the picture of cleaning, which is to need Cluster where the picture to be cleaned.That is, the class label for the picture for including in the first data clusters cluster is judged as Correctly, the class label for the picture for including in the second data clusters cluster is judged as that there may be mistakes.
For example, obtaining data characteristics F1、F2、F3... ..., F1000Cluster result after, electronic equipment can determine therefrom that First data clusters cluster and the second data clusters cluster out, wherein the class label quilt for the picture for including in the first data clusters cluster It is considered correctly, and the class label for the picture for including in the second data clusters cluster is considered as that there may be mistakes.It needs It is noted that the class label for the data for including in the second data clusters cluster actually may be strictly mistake, it is also possible to Actually correctly.
It should be noted that in one embodiment, the first data clusters cluster may include a cluster, and the second data It may include multiple clusters in clustering cluster, i.e. the second data clusters cluster can have multiple.
For example, electronic equipment determines a first data clusters cluster and two the second data clusters according to cluster result Cluster, wherein including 800 samples (such as 800 pictures), two for including in the first data clusters cluster in the second data clusters cluster The total sample number of cluster is 200 (such as 200 pictures).
205, electronic equipment determines the wrong data of class label and deletion from the second data clusters cluster, has been located Manage data.
206, according in the first data clusters cluster data and reduced data, electronic equipment obtain target data.
For example, electronic equipment can be to the second data after determining the first data clusters cluster and the second data clusters cluster The picture for including in clustering cluster carries out data cleansing processing, to obtain reduced data.Later, electronic equipment can be according to The picture and the reduced data for including in one data clusters cluster obtain target data.It is understood that the target data is For the clean picture obtained after data cleansing processing.
For example, including 5 pictures, respectively P in the second data clusters cluster7、P21、P81、P200And P751.So, electronics is set It is standby can be to picture P7、P21、P81、P200And P751Carry out data cleansing processing.For example, electronic equipment is determined in this 5 picture P7、P81、P200Class label it is errorless, and P21And P751Class label it is wrong.So, electronic equipment can be by P21And P751This Two pictures are deleted, thus obtain reduced data, i.e. picture P7、P81、P200
Later, picture and P that electronic equipment can will include in the first data clusters cluster7、P81、P200Merge, obtains data The correct pictures of cleaned pictures, i.e. class label.
In one embodiment, after determining the second data clusters cluster, can be sentenced by way of manual inspection Whether the class label of the data in disconnected second data clusters cluster is wrong.For example, include 5 pictures in the second data clusters cluster, Respectively P7、P21、P81、P200And P751.By manual inspection, inspection personnel determines picture P7、P81、P200Class label without Accidentally, P21And P751Class label it is wrong.So, inspection result can be inputted electronic equipment by inspection personnel, then electronic equipment Get P in this 5 picture7、P81、P200Class label is errorless and P21And P751The wrong information of class label.
In one embodiment, electronic equipment using default Feature Selection Model carries out feature to each data in 202 It extracts, obtains the process of multiple data characteristicses, may include:
When the computing capability of electronic equipment is lower than second threshold, using default Feature Selection Model to each data Feature extraction is carried out, multiple data characteristicses are obtained.
For example, electronic equipment can calculate energy at it when the data set for needing to carry out data cleansing processing is pictures Power is lower than second threshold, i.e., when the current computing capability of electronic equipment is general or weaker, using default Feature Selection Model to every One picture carries out feature extraction, obtains multiple features.
In another embodiment, the present embodiment can also include following process:
When the computing capability of electronic equipment is not less than second threshold, the 4th model is obtained, and utilize the 4th model pair Each data carry out feature extraction, obtain multiple data characteristicses, wherein the feature extraction precision of the 4th model is higher than default feature Extract model.
For example, electronic equipment can calculate energy at it when the data set for needing to carry out data cleansing processing is pictures Power is not less than second threshold, i.e., when the current computing capability of electronic equipment is stronger, obtains the 4th model, and utilize the 4th model Feature extraction is carried out to each picture, obtains multiple features.Wherein, the feature extraction precision of the 4th model is higher than default feature and mentions Modulus type.
For example, the 4th model can be the more complicated list of structure compared to ResNet model used in the present embodiment A model, such as Inception-Resnet-v2.Alternatively, the 4th model can be the fusion (stacking) of multiple models.Example Such as, the structure of the 4th model can be as shown in Figure 4.Image data is inputed into multiple first-level models (Level 1) simultaneously, then The feature that first-level model is extracted finally uses the output of second-level model as output feature, uses as the input of second-level model In subsequent cluster.Wherein Model 1, Model 2, Model 3 can select common deep learning model, as ResNet, Inception, MobileNet etc., and Model 4 can choose better simply conventional machines learning model, such as linear regression. The fusion of multi-model combines the advantage of a variety of models, stronger to the extractability of feature, so that the effect of subsequent cleaning is more preferable, But the resource of consumption is also more, is suitble to use in the case where electronic equipment operational capability is sufficient.
In one embodiment, the operational capability of electronic equipment can be in such as CPU usage and/or remaining operation The capacity and/or remaining running memory capacity deposited ratio etc. shared in running memory total capacity.
In the present embodiment, in 203 electronic equipment use for carrying out the hierarchical clustering of clustering processing to data characteristics Algorithm can be AGENS hierarchical clustering algorithm etc..Wherein, the cluster side that AGENS hierarchical clustering algorithm is " from bottom to top " Method, the number of the not specified clustering cluster of the clustering method, but segmentation threshold and required clustering cluster number are determined according to dendrogram.
It is of course also possible to use other clustering algorithms, such as DIANA algorithm and K-means algorithm.Wherein, DIANA algorithm Suitable for majority of case, which also belongs to hierarchical clustering algorithm, first by all object initializations into a cluster, then The cluster is classified according to some principles, has been more than some until reaching the distance between the number of clusters mesh specified of user or two clusters Threshold value.The case where categorical measure that K-means algorithm is suitble to known dirty data to include.The algorithm needs first to specify the number of clustering cluster The case where measuring, therefore being suitble to previously known categorical measure.It, can be with for example, when being only mixed into the picture of cat in the picture of dog It is clustered using k-means method, and specified final required clustering cluster quantity is 2.
Fig. 5 to Figure 10 is please referred to, Fig. 5 to Figure 10 is that the scene of data processing method provided by the embodiments of the present application is illustrated Figure.
For example, as shown in figure 5, user currently needs to carry out a pictures data cleansing processing, the figure in the pictures Piece is labeled with identical class label.So, electronic equipment can first obtain the pictures, and use default feature extraction mould Type carries out feature extraction to each picture, obtains feature set.For example, pictures are P={ P1、P2、P3... ..., P1000}.Feature Integrate as F={ F1、F2、F3... ..., F1000, wherein feature FiIt is picture PiFeature, i is integer more than or equal to 1.
Extract obtain the feature of each picture after, electronic equipment can be used AGENS hierarchical clustering algorithm to feature set F into Row clustering processing obtains hierarchical clustering figure.After obtaining hierarchical clustering figure, electronic equipment can be shown the hierarchical clustering figure It is checked on the display screen of electronic equipment for user, as shown in Figure 6.
For example, user after checking the hierarchical clustering figure can it is empirically determined go out segmentation threshold, and by the segmentation threshold Value is input in electronic equipment, as shown in Figure 7.
Electronic equipment can determine according to the segmentation threshold and hierarchical clustering figure after getting the segmentation threshold One picture clustering cluster and second picture clustering cluster.Wherein, the quantity of the first picture clustering cluster is noticeably greater than second picture clustering cluster Quantity.
Later, for example, as shown in figure 8, the first picture clustering cluster and second picture clustering cluster are divided into two by electronic equipment File.
Later, user can use electronic equipment and carry out manual review to the picture in second picture clustering cluster, will wherein The picture of class label mistake really is deleted, and is screened out from it the errorless picture of class label, and by such distinguishing label without Picture accidentally is saved in the file of the first picture clustering cluster.For example, as shown in figure 9, detecting P by manual review21With P751The class label of this two picture is really wrong, then user can delete this two picture.
It is understood that after the picture for including in the corresponding file of the first picture clustering cluster at this time is data cleansing Obtained clean data.
Separately referring to Fig. 10, Figure 10 is process flow diagram provided in this embodiment.
In the present embodiment, electronic equipment can use clustering processing to carry out data cleansing work.Due to using at cluster Reason can quickly determine out the wrong data of class label, and by the electronic equipment data wrong to this partial category label into Row data cleaning treatment.Therefore, the present embodiment can be quickly obtained clean data.Compared in the related technology by manually one by one Browsing checks the whether wrong data cleansing mode of label information of data, this embodiment reduces a large amount of labor workload, The efficiency that data cleansing can be improved reduces the cost of data cleansing.
In addition, the present embodiment carries out data cleansing work in the way of cluster, can achieve with similar in manually cleaning Accuracy.Also, the present embodiment provides being that data cleansing mode its data cleansing process can be recalled, other personnel can be by clear It washes journey and checks data cleansing quality.
Figure 11 is please referred to, Figure 11 is the structural schematic diagram of data processing equipment provided by the embodiments of the present application.Data processing Device 300 may include: to obtain module 301, extraction module 302, cluster module 303, determining module 304, first processing module 305, Second processing module 306.
Module 301 is obtained, for obtaining multiple data, the multiple data carry identical class label.
Extraction module 302 obtains multiple data characteristicses for extracting the feature of each data.
Cluster module 303 obtains cluster result for carrying out clustering processing to the multiple data characteristics.
Determining module 304, for determining the first data clusters cluster and the second data clusters cluster according to the cluster result, Wherein the first data clusters cluster is the cluster not needed where the data of cleaning, and the second data clusters cluster is to need to clean Data where cluster.
First processing module 305 is obtained for carrying out data cleansing processing to the data in the second data clusters cluster Reduced data.
Second processing module 306, for according in the first data clusters cluster data and the reduced data, obtain To target data.
In one embodiment, determining module 304 can be used for:
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data The sample size for including in clustering cluster is denoted as the first quantity, and the sample size for including in the second data clusters cluster is denoted as second The difference of quantity, first quantity and second quantity is greater than first threshold, and the second data clusters cluster includes at least One cluster.
In one embodiment, cluster module 303 can be used for: hierarchical clustering algorithm be utilized, to the multiple data Feature carries out clustering processing, obtains cluster result.
So, determining module 304 can be used for: obtaining segmentation threshold, and is tied according to the segmentation threshold and the cluster Fruit determines the first data clusters cluster and the second data clusters cluster, wherein the cluster result is hierarchical clustering figure, the segmentation threshold Value is for selecting required clustering cluster from the hierarchical clustering figure.
In one embodiment, obtaining module 301 can be used for:
When the multiple data are picture, the first model is obtained, first model is trained according to ImageNet The ResNet model arrived;
Learning training is carried out to the ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model, and by the third model It is determined as default Feature Selection Model;
So, extraction module 302 can be used for: be carried out using the default Feature Selection Model to each data special Sign is extracted, and multiple data characteristicses are obtained.
In one embodiment, extraction module 302 can be used for:
When the computing capability of electronic equipment is lower than second threshold, using the default Feature Selection Model to each described Data carry out feature extraction, obtain multiple data characteristicses.
In one embodiment, extraction module 302 can be also used for:
When the computing capability of the electronic equipment is not less than the second threshold, the 4th model is obtained, and described in utilization 4th model carries out feature extraction to each data, multiple data characteristicses is obtained, wherein the feature of the 4th model mentions Precision is taken to be higher than the default Feature Selection Model.
In one embodiment, cluster module 303 can be used for:
Using hierarchical clustering algorithm, clustering processing is carried out to the multiple data characteristics, obtains cluster result, wherein In The distance between sample is measured using Ming Shi distance when clustering processing, takes the distance between any two for belonging to inhomogeneous sample Between class distance of the mean value as two classes.
In one embodiment, first processing module 305 can be used for:
The wrong data of class label and deletion are determined from the second data clusters cluster, obtain reduced data.
The embodiment of the present application provides a kind of computer-readable storage medium, computer program is stored thereon with, when described When computer program executes on computers, so that the computer is executed as in data processing method provided in this embodiment Process.
The embodiment of the present application also provides a kind of electronic equipment, including memory, and processor, the processor is by calling institute The computer program stored in memory is stated, for executing the process in data processing method provided in this embodiment.
For example, above-mentioned electronic equipment can be the mobile terminals such as tablet computer or smart phone.Figure 12 is please referred to, Figure 12 is the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
The electronic equipment 400 may include the components such as display screen 401, memory 402, processor 403.Those skilled in the art Member is appreciated that electronic devices structure shown in Figure 12 does not constitute the restriction to electronic equipment, may include than illustrating more More or less component perhaps combines certain components or different component layouts.
Display screen 401 is displayed for the information such as picture and text.
Memory 402 can be used for storing application program and data.It include that can hold in the application program that memory 402 stores Line code.Application program can form various functional modules.Processor 403 is stored in the application journey of memory 402 by operation Sequence, thereby executing various function application and data processing.
Processor 403 is the control centre of electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection A part by running or execute the application program being stored in memory 402, and is called and is stored in memory 402 Data execute the various functions and processing data of electronic equipment, to carry out integral monitoring to electronic equipment.
In the present embodiment, the processor 403 in electronic equipment can be according to following instruction, will be one or more The corresponding executable code of the process of application program is loaded into memory 402, and is run by processor 403 and be stored in storage Application program in device 402, thereby executing:
Multiple data are obtained, the multiple data carry identical class label;
The feature for extracting each data, obtains multiple data characteristicses;
Clustering processing is carried out to the multiple data characteristics, obtains cluster result;
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data Clustering cluster is the cluster not needed where the data of cleaning, and the second data clusters cluster is the cluster where the data for needing to clean;
Data cleansing processing is carried out to the data in the second data clusters cluster, obtains reduced data;
According in the first data clusters cluster data and the reduced data, obtain target data.
Figure 13 is please referred to, electronic equipment 400 may include display screen 401, memory 402, processor 403, input unit 404, the components such as power supply 405.
Display screen 401 is displayed for the information such as picture and text.
Memory 402 can be used for storing application program and data.It include that can hold in the application program that memory 402 stores Line code.Application program can form various functional modules.Processor 403 is stored in the application journey of memory 402 by operation Sequence, thereby executing various function application and data processing.
Processor 403 is the control centre of electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection A part by running or execute the application program being stored in memory 402, and is called and is stored in memory 402 Data execute the various functions and processing data of electronic equipment, to carry out integral monitoring to electronic equipment.
Input unit 404 can be used for receiving number, character information or the user's characteristic information (such as fingerprint) of input, and Generate keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal input.
Power supply 405 can be used for providing electric power guarantee for each component.
In the present embodiment, the processor 403 in electronic equipment can be according to following instruction, will be one or more The corresponding executable code of the process of application program is loaded into memory 402, and is run by processor 403 and be stored in storage Application program in device 402, thereby executing:
Multiple data are obtained, the multiple data carry identical class label;
The feature for extracting each data, obtains multiple data characteristicses;
Clustering processing is carried out to the multiple data characteristics, obtains cluster result;
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data Clustering cluster is the cluster not needed where the data of cleaning, and the second data clusters cluster is the cluster where the data for needing to clean;
Data cleansing processing is carried out to the data in the second data clusters cluster, obtains reduced data;
According in the first data clusters cluster data and the reduced data, obtain target data.
In one embodiment, processor 403 is executed according to the cluster result, determines the first data clusters cluster and the It when two data clusters clusters, can execute: according to the cluster result, determine the first data clusters cluster and the second data clusters cluster, The sample size for wherein including in the first data clusters cluster is denoted as the first quantity, includes in the second data clusters cluster Sample size is denoted as the second quantity, and the difference of first quantity and second quantity is counted greater than first threshold, described second Cluster is contained at least one according to clustering cluster.
In one embodiment, processor 403, which is executed, carries out clustering processing to the multiple data characteristics, is clustered It when as a result, can execute: using hierarchical clustering algorithm, clustering processing be carried out to the multiple data characteristics, obtains cluster result.
So, processor 403 is executed according to the cluster result, determines the first data clusters cluster and the second data clusters cluster When, it can execute: obtain segmentation threshold, and according to the segmentation threshold and the cluster result, determine the first data clusters cluster With the second data clusters cluster, wherein the cluster result is hierarchical clustering figure, the segmentation threshold is used for from the hierarchical clustering Required clustering cluster is selected in figure.
In one embodiment, processor 403 can also be performed: when the multiple data are picture, obtain first Model, first model are the ResNet model obtained according to ImageNet training;Using the multiple data to described ResNet model carries out learning training, obtains the second model;The full articulamentum for being located at the second model the last layer is removed Third model is obtained, and the third model is determined as default Feature Selection Model.
So, processor 403 executes the feature for extracting each data, can be with when obtaining multiple data characteristicses It executes: feature extraction being carried out to each data using the default Feature Selection Model, obtains multiple data characteristicses.
In one embodiment, processor 403 is executed using the default Feature Selection Model to each data Feature extraction is carried out, when obtaining multiple data characteristicses, can be executed: when the computing capability of electronic equipment is lower than second threshold, Feature extraction is carried out to each data using the default Feature Selection Model, obtains multiple data characteristicses.
In one embodiment, processor 403 can also be performed: when the computing capability of the electronic equipment is not less than institute When stating second threshold, the 4th model is obtained, and feature extraction is carried out to each data using the 4th model, obtained more A data characteristics, wherein the feature extraction precision of the 4th model is higher than the default Feature Selection Model.
In one embodiment, processor 403 executes described using hierarchical clustering algorithm, to the multiple data characteristics Clustering processing is carried out, when obtaining cluster result, can be executed: using hierarchical clustering algorithm, the multiple data characteristics being carried out Clustering processing obtains cluster result, wherein measures the distance between sample using Ming Shi distance in clustering processing, takes and adhere to separately In between class distance of the mean value as two classes of the distance between any two of inhomogeneous sample.
In one embodiment, processor 403 executes clear to the data progress data in the second data clusters cluster Processing is washed, when obtaining reduced data, can be executed: determining the wrong number of class label from the second data clusters cluster According to and delete, obtain reduced data.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the detailed description above with respect to data processing method, details are not described herein again.
Data processing method in the data processing equipment provided by the embodiments of the present application and foregoing embodiments belongs to together One design can run either offer method in the data processing method embodiment on the data processing equipment, Specific implementation process is detailed in the data processing method embodiment, and details are not described herein again.
It should be noted that those of ordinary skill in the art can for the data processing method described in the embodiment of the present application With understand realize the embodiment of the present application described in data processing method all or part of the process, be can by computer program come Relevant hardware is controlled to complete, the computer program can be stored in a computer-readable storage medium, such as be stored in It in memory, and is executed by least one processor, in the process of implementation may include the embodiment such as the data processing method Process.Wherein, the storage medium can be magnetic disk, CD, read-only memory (ROM, Read Only Memory), random Access/memory body (RAM, Random Access Memory) etc..
For the data processing equipment of the embodiment of the present application, each functional module be can integrate in a processing core In piece, it is also possible to modules and physically exists alone, can also be integrated in two or more modules in a module.On It states integrated module both and can take the form of hardware realization, can also be realized in the form of software function module.The collection If at module realized in the form of software function module and when sold or used as an independent product, also can store In one computer-readable storage medium, the storage medium is for example read-only memory, disk or CD etc..
Above to a kind of data processing method, device, storage medium and electronic equipment provided by the embodiment of the present application It is described in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above reality The explanation for applying example is merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art, According to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion in this specification Hold the limitation that should not be construed as to the application.

Claims (11)

1. a kind of data processing method characterized by comprising
Multiple data are obtained, the multiple data carry identical class label;
The feature for extracting each data, obtains multiple data characteristicses;
Clustering processing is carried out to the multiple data characteristics, obtains cluster result;
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data clusters Cluster is the cluster not needed where the data of cleaning, and the second data clusters cluster is the cluster where the data for needing to clean;
Data cleansing processing is carried out to the data in the second data clusters cluster, obtains reduced data;
According in the first data clusters cluster data and the reduced data, obtain target data.
2. data processing method according to claim 1, which is characterized in that according to the cluster result, determine the first number According to clustering cluster and the second data clusters cluster, comprising:
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein first data clusters The sample size for including in cluster is denoted as the first quantity, and the sample size for including in the second data clusters cluster is denoted as the second number The difference of amount, first quantity and second quantity is greater than first threshold, and the second data clusters cluster includes at least one A cluster.
3. data processing method according to claim 2, which is characterized in that carried out at cluster to the multiple data characteristics Reason, obtains cluster result, comprising: utilizes hierarchical clustering algorithm, carries out clustering processing to the multiple data characteristics, clustered As a result;
According to the cluster result, the first data clusters cluster and the second data clusters cluster are determined, comprising: segmentation threshold is obtained, and According to the segmentation threshold and the cluster result, the first data clusters cluster and the second data clusters cluster are determined, wherein described poly- Class result is hierarchical clustering figure, and the segmentation threshold is for selecting required clustering cluster from the hierarchical clustering figure.
4. data processing method according to claim 1, which is characterized in that the method also includes:
When the multiple data are picture, the first model is obtained, first model is obtained according to ImageNet training ResNet model;
Learning training is carried out to the ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model, and the third model is determined To preset Feature Selection Model;
The feature for extracting each data, obtains multiple data characteristicses, comprising: utilizes the default Feature Selection Model Feature extraction is carried out to each data, obtains multiple data characteristicses.
5. data processing method according to claim 4, which is characterized in that using the default Feature Selection Model to every One data carry out feature extraction, obtain multiple data characteristicses, comprising:
When the computing capability of electronic equipment is lower than second threshold, using the default Feature Selection Model to each data Feature extraction is carried out, multiple data characteristicses are obtained.
6. data processing method according to claim 5, which is characterized in that the method also includes:
When the computing capability of the electronic equipment is not less than the second threshold, the 4th model is obtained, and utilize the described 4th Model carries out feature extraction to each data, obtains multiple data characteristicses, wherein the feature extraction essence of the 4th model Degree is higher than the default Feature Selection Model.
7. data processing method according to claim 3, which is characterized in that it is described to utilize hierarchical clustering algorithm, to described Multiple data characteristicses carry out clustering processing, obtain cluster result, comprising:
Using hierarchical clustering algorithm, clustering processing is carried out to the multiple data characteristics, obtains cluster result, wherein clustering The distance between sample is measured using Ming Shi distance when processing, takes the equal of the distance between any two for belonging to inhomogeneous sample It is worth the between class distance as two classes.
8. data processing method according to claim 1, which is characterized in that the data in the second data clusters cluster Data cleansing processing is carried out, reduced data is obtained, comprising:
The wrong data of class label and deletion are determined from the second data clusters cluster, obtain reduced data.
9. a kind of data processing equipment characterized by comprising
Module is obtained, for obtaining multiple data, the multiple data carry identical class label;
Extraction module obtains multiple data characteristicses for extracting the feature of each data;
Cluster module obtains cluster result for carrying out clustering processing to the multiple data characteristics;
Determining module, for determining the first data clusters cluster and the second data clusters cluster according to the cluster result, wherein described First data clusters cluster is the cluster not needed where the data of cleaning, the data institute that the second data clusters cluster cleans for needs Cluster;
First processing module obtains processed for carrying out data cleansing processing to the data in the second data clusters cluster Data;
Second processing module, for according in the first data clusters cluster data and the reduced data, obtain target Data.
10. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the computer program is in computer When upper execution, so that the computer executes such as method described in any item of the claim 1 to 8.
11. a kind of electronic equipment, including memory, processor, which is characterized in that the processor is by calling the memory The computer program of middle storage, for executing such as method described in any item of the claim 1 to 8.
CN201910713732.8A 2019-08-02 2019-08-02 Data processing method, data processing device, storage medium and electronic equipment Active CN110472082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910713732.8A CN110472082B (en) 2019-08-02 2019-08-02 Data processing method, data processing device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910713732.8A CN110472082B (en) 2019-08-02 2019-08-02 Data processing method, data processing device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110472082A true CN110472082A (en) 2019-11-19
CN110472082B CN110472082B (en) 2022-04-01

Family

ID=68509390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910713732.8A Active CN110472082B (en) 2019-08-02 2019-08-02 Data processing method, data processing device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110472082B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340084A (en) * 2020-02-20 2020-06-26 北京市商汤科技开发有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111460195A (en) * 2020-03-26 2020-07-28 Oppo广东移动通信有限公司 Picture processing method and device, storage medium and electronic equipment
CN112256766A (en) * 2020-11-02 2021-01-22 浙江八达电子仪表有限公司 Power consumption behavior analysis method for energy collection terminal
CN112348107A (en) * 2020-11-17 2021-02-09 百度(中国)有限公司 Image data cleaning method and apparatus, electronic device, and medium
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data
CN112465020A (en) * 2020-11-25 2021-03-09 创新奇智(合肥)科技有限公司 Training data set generation method and device, electronic equipment and storage medium
CN113518058A (en) * 2020-04-09 2021-10-19 中国移动通信集团海南有限公司 Abnormal login behavior detection method and device, storage medium and computer equipment
CN114638322A (en) * 2022-05-20 2022-06-17 南京大学 Full-automatic target detection system and method based on given description in open scene
CN117235448A (en) * 2023-11-14 2023-12-15 北京阿丘科技有限公司 Data cleaning method, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430708A (en) * 2008-11-21 2009-05-13 哈尔滨工业大学深圳研究生院 Blog hierarchy classification tree construction method based on label clustering
US20130002810A1 (en) * 2011-06-30 2013-01-03 Stauder Juergen Outlier detection for colour mapping
CN105678232A (en) * 2015-12-30 2016-06-15 中通服公众信息产业股份有限公司 Face image feature extraction and comparison method based on deep learning
CN106547893A (en) * 2016-11-03 2017-03-29 福建中金在线信息科技有限公司 A kind of photo sort management system and photo sort management method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430708A (en) * 2008-11-21 2009-05-13 哈尔滨工业大学深圳研究生院 Blog hierarchy classification tree construction method based on label clustering
US20130002810A1 (en) * 2011-06-30 2013-01-03 Stauder Juergen Outlier detection for colour mapping
CN105678232A (en) * 2015-12-30 2016-06-15 中通服公众信息产业股份有限公司 Face image feature extraction and comparison method based on deep learning
CN106547893A (en) * 2016-11-03 2017-03-29 福建中金在线信息科技有限公司 A kind of photo sort management system and photo sort management method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹科研: "不确定数据的聚类分析与异常点检测算法", 《中国优秀博士学位论文全文数据库 信息科技辑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340084A (en) * 2020-02-20 2020-06-26 北京市商汤科技开发有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111340084B (en) * 2020-02-20 2024-05-17 北京市商汤科技开发有限公司 Data processing method and device, processor, electronic equipment and storage medium
CN111460195A (en) * 2020-03-26 2020-07-28 Oppo广东移动通信有限公司 Picture processing method and device, storage medium and electronic equipment
CN113518058B (en) * 2020-04-09 2022-12-13 中国移动通信集团海南有限公司 Abnormal login behavior detection method and device, storage medium and computer equipment
CN113518058A (en) * 2020-04-09 2021-10-19 中国移动通信集团海南有限公司 Abnormal login behavior detection method and device, storage medium and computer equipment
CN112256766A (en) * 2020-11-02 2021-01-22 浙江八达电子仪表有限公司 Power consumption behavior analysis method for energy collection terminal
CN112348107A (en) * 2020-11-17 2021-02-09 百度(中国)有限公司 Image data cleaning method and apparatus, electronic device, and medium
CN112465020A (en) * 2020-11-25 2021-03-09 创新奇智(合肥)科技有限公司 Training data set generation method and device, electronic equipment and storage medium
CN112418169A (en) * 2020-12-10 2021-02-26 上海芯翌智能科技有限公司 Method and equipment for processing human body attribute data
CN114638322A (en) * 2022-05-20 2022-06-17 南京大学 Full-automatic target detection system and method based on given description in open scene
CN114638322B (en) * 2022-05-20 2022-09-13 南京大学 Full-automatic target detection system and method based on given description in open scene
CN117235448A (en) * 2023-11-14 2023-12-15 北京阿丘科技有限公司 Data cleaning method, terminal equipment and storage medium
CN117235448B (en) * 2023-11-14 2024-02-06 北京阿丘科技有限公司 Data cleaning method, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN110472082B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
CN110472082A (en) Data processing method, device, storage medium and electronic equipment
Xie et al. Unseen object instance segmentation for robotic environments
Patel Hands-on unsupervised learning using Python: how to build applied machine learning solutions from unlabeled data
Sharma et al. An analysis of convolutional neural networks for image classification
CN112632385A (en) Course recommendation method and device, computer equipment and medium
CN110321477A (en) Information recommendation method, device, terminal and storage medium
CN107209860A (en) Optimize multiclass image classification using blocking characteristic
CN104573130B (en) The entity resolution method and device calculated based on colony
Zhang et al. Sequential optimization for efficient high-quality object proposal generation
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN110490237A (en) Data processing method, device, storage medium and electronic equipment
CN106874292A (en) Topic processing method and processing device
CN102201062A (en) Information processing apparatus, method and program
CN110276406A (en) Expression classification method, apparatus, computer equipment and storage medium
CN109284675A (en) A kind of recognition methods of user, device and equipment
CN110580489B (en) Data object classification system, method and equipment
CN108228844A (en) A kind of picture screening technique and device, storage medium, computer equipment
CN107545276A (en) The various visual angles learning method of joint low-rank representation and sparse regression
CN110110113A (en) Image search method, system and electronic device
CN111737479B (en) Data acquisition method and device, electronic equipment and storage medium
CN108536784A (en) Comment information sentiment analysis method, apparatus, computer storage media and server
Somnugpong et al. Content-based image retrieval using a combination of color correlograms and edge direction histogram
Yang et al. Multi-scale bidirectional fcn for object skeleton extraction
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium
CN109598285A (en) A kind of processing method of model, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant