CN108764319A - A kind of sample classification method and apparatus - Google Patents

A kind of sample classification method and apparatus Download PDF

Info

Publication number
CN108764319A
CN108764319A CN201810487963.7A CN201810487963A CN108764319A CN 108764319 A CN108764319 A CN 108764319A CN 201810487963 A CN201810487963 A CN 201810487963A CN 108764319 A CN108764319 A CN 108764319A
Authority
CN
China
Prior art keywords
submanifold
similarity
cluster centre
sample
training sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810487963.7A
Other languages
Chinese (zh)
Inventor
张明阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810487963.7A priority Critical patent/CN108764319A/en
Publication of CN108764319A publication Critical patent/CN108764319A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of sample classification method and apparatus, are related to field of computer technology.One specific implementation mode of this method includes:The similarity for calculating test sample and the cluster centre of multiple submanifolds determines according to the similarity and predetermined threshold value and chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;From the corresponding submanifold of the highest cluster centre of similarity, select with the similarity of the cluster centre in the training sample for choosing section;Using the training sample selected as new training sample set, to classify to the test sample.This method is for each test sample, according to determining selection section, training sample is selected from submanifold corresponding with the highest cluster centre of its similarity, classified to each test sample using the training sample selected, reduce the follow-up training samples number classified, improves the sample classification efficiency under big data environment.

Description

A kind of sample classification method and apparatus
Technical field
The present invention relates to computer realm more particularly to a kind of sample classification method and apparatus.
Background technology
K nearest neighbor algorithm is widely used in many fields, such as recognition of face, gene point because it is simple and is easily achieved Class, decision support etc..The basic thought of k nearest neighbor algorithm is:For given test sample x, found in training sample set Its K nearest samples, and determine according to the classification of this K nearest samples the classification of test sample x.
In realizing process of the present invention, inventor has found that at least there are the following problems in the prior art:K nearest neighbor algorithm is being sought During looking for nearest samples, need to calculate test sample one by one at a distance from each training sample in training sample set (or similarity), when training sample set is combined into big data, above-mentioned calculating process will produce very high expense, lead to algorithm Efficiency becomes very low or even infeasible.
Invention content
In view of this, a kind of sample classification method and apparatus of offer of the embodiment of the present invention press each test sample According to determining selection section, training sample is selected from submanifold corresponding with the highest cluster centre of its similarity, uses choosing The training sample of taking-up classifies to each test sample, improves the sample classification efficiency under big data environment.
To achieve the above object, one side according to the ... of the embodiment of the present invention provides a kind of sample classification method.
A kind of sample classification method of the embodiment of the present invention, including:Calculate the cluster centre of test sample and multiple submanifolds Similarity, determined according to the similarity and predetermined threshold value and choose section;Wherein, the submanifold be to training sample set into It is obtained after row sub-clustering;From the corresponding submanifold of the highest cluster centre of similarity, select similar to the cluster centre It spends in the training sample for choosing section;Using the training sample selected as new training sample set, with to the survey Sample is originally classified.
Optionally, described determined according to the similarity and predetermined threshold value chooses section, including:It will be highest described similar Degree reduces predetermined threshold value, using the value after reduction as the minimum value for choosing section;The highest similarity is increased into the threshold Value, using the value after increase as the maximum value for choosing section.
Optionally, before described the step of calculating test sample and the similarity of the cluster centre of multiple submanifolds, further include: Sub-clustering is carried out to the training sample set, to obtain multiple submanifolds;Determine the cluster centre of each submanifold.
Optionally, the determination each the cluster centre of the submanifold the step of before, further include:To each submanifold It is compressed;The cluster centre of each submanifold of the determination, including:In the cluster for determining compressed each submanifold The heart.
Optionally, the cluster centre of the compressed each submanifold of the determination, including:Calculate compressed each institute The coordinate average value of all training samples in submanifold is stated, the coordinate average value is the cluster of compressed each submanifold The coordinate at center.
To achieve the above object, another aspect according to the ... of the embodiment of the present invention provides a kind of sample classification device.
A kind of sample classification device of the embodiment of the present invention, including:Determining module, for calculating test sample and multiple sons The similarity of the cluster centre of cluster determines according to the similarity and predetermined threshold value and chooses section;Wherein, the submanifold is to instruction It is obtained after white silk sample set progress sub-clustering;Module is chosen, for from the corresponding submanifold of the highest cluster centre of similarity, selecting It takes out with the similarity of the cluster centre in the training sample for choosing section;Sort module, the instruction for will select Practice sample as new training sample set, to classify to the test sample.
Optionally, the determining module, is additionally operable to:The highest similarity is reduced into predetermined threshold value, after reduction It is worth as the minimum value for choosing section;And the highest similarity is increased into the threshold value, using the value after increase as institute State the maximum value for choosing section.
Optionally, described device further includes:Sub-clustering determining module, for carrying out sub-clustering to the training sample set, with Obtain multiple submanifolds;And determine the cluster centre of each submanifold.
Optionally, described device further includes:Compression module, for being compressed to each submanifold;The sub-clustering is true Cover half block is additionally operable to determine the cluster centre of compressed each submanifold.
Optionally, the sub-clustering determining module, is additionally operable to:Calculate all training samples in compressed each submanifold Coordinate average value, the coordinate average value is the coordinate of the cluster centre of compressed each submanifold.
To achieve the above object, according to the ... of the embodiment of the present invention in another aspect, providing a kind of electronic equipment.
The a kind of electronic equipment of the embodiment of the present invention, including:One or more processors;Storage device, for storing one A or multiple programs, when one or more of programs are executed by one or more of processors so that one or more A processor realizes a kind of sample classification method of the embodiment of the present invention.
To achieve the above object, according to the ... of the embodiment of the present invention in another aspect, providing a kind of computer-readable medium.
A kind of computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, and described program is handled A kind of sample classification method of the embodiment of the present invention is realized when device executes.
One embodiment in foregoing invention has the following advantages that or advantageous effect:For each test sample, according to Determining selection section selects training sample from submanifold corresponding with the highest cluster centre of its similarity, uses selection The training sample gone out classifies to each test sample, reduces the follow-up training samples number classified, and improves big Sample classification efficiency under data environment;It is determined according to similarity and predetermined threshold value and chooses section, adjustment is facilitated to classify Specific training sample, favorable expandability;By carrying out sub-clustering to training sample set, and determine the cluster centre of each submanifold, both It can ensure the accuracy of classification, and reduce the quantity of training sample, improve sample classification efficiency;Each submanifold is compressed It calculates cluster centre again afterwards, further reduced the quantity of training sample, further improve classification effectiveness.
Further effect possessed by above-mentioned non-usual optional mode adds hereinafter in conjunction with specific implementation mode With explanation.
Description of the drawings
Attached drawing does not constitute inappropriate limitation of the present invention for more fully understanding the present invention.Wherein:
Fig. 1 is the schematic diagram of the key step of sample classification method according to the ... of the embodiment of the present invention;
Fig. 2 is the main flow schematic diagram of sample classification method according to the ... of the embodiment of the present invention;
Fig. 3 is the principle of classification schematic diagram of the sample classification method of the embodiment of the present invention;
Fig. 4 is the classification results schematic diagram of the sample classification method of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the main modular of sample classification device according to the ... of the embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 7 is the structural schematic diagram for the computer installation for being suitable for the electronic equipment to realize the embodiment of the present invention.
Specific implementation mode
It explains to the exemplary embodiment of the present invention below in conjunction with attached drawing, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together The description to known function and structure is omitted for clarity and conciseness in sample in following description.
Fig. 1 is the schematic diagram of the key step of sample classification method according to the ... of the embodiment of the present invention.As shown in Figure 1, this hair The sample classification method of bright embodiment, mainly includes the following steps:
Step S101:The similarity for calculating test sample and the cluster centre of multiple submanifolds, according to the similarity and in advance If threshold value, which determines, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set.To each test specimens Before this is classified, clustering algorithm need to be used to carry out sub-clustering to training sample set, to obtain multiple submanifolds, and determined each The cluster centre of the submanifold.It calculates similarity and Euclidean distance, COS distance, Chebyshev's distance etc. can be used;Predetermined threshold value Can be the numerical value between 0- maximum similarities.Determine that the process for choosing section can be with according to the similarity and predetermined threshold value For:The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;By highest institute It states similarity and increases the threshold value, using the value after increase as the maximum value for choosing section.
Step S102:From the corresponding submanifold of the highest cluster centre of similarity, the phase with the cluster centre is selected Like degree in the training sample for choosing section.For using Euclidean distance formula to calculate similarity, training sample is chosen Process is:From the corresponding submanifold of cluster centre of Euclidean distance minimum, select the cluster centre of the submanifold it is European away from From the training sample in the selection section.
Step S103:Using the training sample selected as new training sample set, to be carried out to the test sample Classification.Using k nearest neighbor algorithm, the K training sample nearest with test sample is found out from new training sample set, this K is a Most of classification is the classification of test sample in training sample.Wherein, K self-defined can be arranged, and be traditionally arranged to be odd number, It may be configured as 3,5,7 etc. in the present invention.
Fig. 2 is the main flow schematic diagram of sample classification method according to the ... of the embodiment of the present invention.As shown in Fig. 2, of the invention The sample classification method of embodiment, mainly includes the following steps:
Step S201:Sub-clustering is carried out to training sample set using clustering algorithm, to obtain multiple submanifolds.Common cluster It is noisy that algorithm can be used in K mean values (K-means) algorithm in the present invention, such as partition clustering, the tool in Density Clustering Density clustering method (Density-Based Spatial Clustering of Applications with Noise, DBSCAN), the gauss hybrid models (GMM) etc. in Model tying, to keep higher classification accuracy.By sample set It closes and is divided into training sample set and test sample set according to preset ratio, which such as can be 7:3, training sample set Include multiple training samples, test sample set includes multiple test samples.There is no training process for k nearest neighbor algorithm Feature, this step have carried out sub-clustering using clustering algorithm to training sample set, that is, introduce a training process.
Step S202:Each submanifold is compressed to obtain compression cluster, in the cluster for determining each compression cluster The heart.Each submanifold is compressed using compression nearest neighbour method or editing nearest neighbour method, so that each submanifold is retaining at least Under conditions of measuring training sample, remain to correctly classify to whole training samples in submanifold with k nearest neighbor algorithm.It determines each described The process of cluster centre for compressing cluster is specially:The coordinate for calculating all training samples in compressed each submanifold is average Value, the coordinate average value are the coordinate of the cluster centre of compressed each submanifold.
Wherein, compression nearest neighbour method can greatly reduce the number of sample set, and the detailed process of the algorithm is:
(1) training set R is divided into two sample sets of A and B, and it is sky that original training set, which closes A,.
(2) it randomly chooses a sample from training set R to be put into A, other samples are put into B, with it to each in B A sample is classified.If sample i can correctly be classified (classification predicted is identical as the classification of sample itself), by it It is put back into B;Otherwise it adds it in A.
(3) it repeats the above process, until all samples can be by correct classify in B.
The principle of editing nearest neighbour method is:Given training set R and classifying rules C, if S is classified regular C mistakes classification These samples are deleted from training set R, obtain R=R-S by sample set.It repeats the above process, until meeting stopping criterion. At the end of the above process, the sample in training set R is all the sample correctly classified by classifying rules C.The detailed process of the algorithm For:
(1) training set R is randomly divided into N groups.
(2) using the union of remaining (N-1) group sample set as training set, to each sample in i-th group of sample set Carry out k nearest neighbor classification.If classification error, it is added in set S, wherein i=1,2 ..., N;
(3) from the sample deleted in training set R in collection and S, new data set R=R-S is formed.It repeats the above process, directly Occur to the sample for not having mistake to classify in nearest I iterative process.
The compression cluster obtained using compression nearest neighbour method is above-mentioned sample set A, the compression obtained using editing nearest neighbour method Cluster is the training set R after the sample set of deletion error classification.
In a preferred embodiment, it is assumed that there are three training sample, the coordinates point of these three training samples in compression cluster It is not:(x1, y1), (x2, y2), (x3, y3), then the coordinate of the cluster centre of these three training samples be:((x1+x2+x3)/ 3, (y1+y2+y3)/3).
Step S203:It calculates in each compression cluster, the similarity of each training sample and corresponding cluster centre, and It is ranked up according to preset order.Wherein, the sequence is by User Defined.In a preferred embodiment, using it is European away from From similarity is calculated, specially:It calculates in compression cluster, the Euclidean distance of each training sample to corresponding cluster centre, and presses It is ranked up according to ascending sequence.
Step S204:The similarity for calculating test sample and the cluster centre of each compression cluster, to find out similarity The corresponding compression cluster of highest cluster centre.In a preferred embodiment, similarity is calculated using Euclidean distance, specially: The Euclidean distance for calculating test sample and the cluster centre of each compression cluster, finds out the cluster centre pair of Euclidean distance minimum The compression cluster answered.
Step S205:At least one training sample is selected from the corresponding compression cluster of the highest cluster centre of similarity. The step is needed according to the similarity d and predetermined threshold value m between test sample and the highest cluster centre of similarity, really first Surely section T is chosen;Later from the corresponding compression cluster of the highest cluster centre of similarity, select and corresponding cluster centre Similarity is in the training sample for choosing section T.
In a preferred embodiment, it is assumed that minimum Euclidean distance is d, predetermined threshold value m, then chooses section T=[d- M, d+m], 0<m<d;From the corresponding compression cluster of Euclidean distance of the minimum, the Europe of the cluster centre of the compression cluster is selected Formula distance is all training samples of [d-m, d+m].
Step S206:Using the training sample selected as new training sample set, to use k nearest neighbor algorithm to described Test sample is classified.K nearest neighbor sequence is carried out to test sample using new training sample set, according to the K values and K of setting The classification of most training samples, determines and exports the prediction classification of test sample in a arest neighbors training sample.Repeat step S204 to step S206, to classify to each test sample in test sample set.
Fig. 3 is the principle of classification schematic diagram of the sample classification method of the embodiment of the present invention.The schematic diagram corresponds to step S201 With step S202.As shown in figure 3, training sample set is divided into 5 submanifolds, C1 is corresponded to C5;To each submanifold respectively into Row compression obtains 5 compression clusters, corresponds to C'1To C'5;The cluster centre for calculating separately each compression cluster, corresponds to O1 to O5. Assuming that having 1000 training samples in training sample set, through K-means algorithms sub-clustering (assuming that K=5), then in each submanifold There are 200 training samples;Each compressed nearest neighbor algorithm compression (assuming that compression ratio is 10%) of submanifold, then each in compression cluster Only 20 training samples.
Fig. 4 is the classification results schematic diagram of the sample classification method of the embodiment of the present invention.As shown in figure 4, intermediate black Circle is the cluster centre O of the minimum compression cluster of the Euclidean distance with test sample, and open circles are test sample D, cluster centre O Euclidean distance with test sample D is d (i.e. the length of OD).It is looked on the line (extended line) of cluster centre O and test sample D Go out at a distance from test sample D be m two points, the two point be respectively E and F, then using cluster centre O as the center of circle, OE and The length of OF is that radius work is justified, i.e. the radius of the two circles is respectively d-m and d+m, is chosen in the compression cluster between d-m and d+m The training sample of (between i.e. two circles) is as new training sample set.In a preferred embodiment, it is with test sample D The training sample in the circle that radius is m is chosen as new training sample set in the center of circle.
In an additional preferred embodiment, the sequence of step S201 and step S202 can be interchanged, i.e., first close using compression Adjacent method or editing nearest neighbour method compress training sample set, obtain a compression cluster;Use clustering algorithm to pressure later Contracting cluster carries out sub-clustering, to obtain multiple submanifolds.The sequence of step S203 and step S204 can be interchanged, i.e., first calculate test sample With it is each it is described compression cluster cluster centre similarity, to find out the corresponding compression cluster of the highest cluster centre of similarity;It It calculates afterwards in the corresponding compression cluster of the highest cluster centre of similarity, the similarity of each training sample to the cluster centre, and It is ranked up according to preset order.
In order to verify the validity of sample classification method proposed by the present invention, we are in four UCI (University of CaliforniaIrvine, University of California at Irvine) on data set with having been carried out about operation based on traditional k nearest neighbor algorithm when Between and the experiment of measuring accuracy compare.This four UCI data sets are respectively Forest CoverType (drymion data Collection), Skin Segmentation (partitioning into skin data set), Statlog (German credit card information collection) and Cmc (Canadian gas As the global snow depth raster dataset being centrally generated).Wherein, UCI data sets are University of California at Irvine propositions for machine The data set of device study, Forest CoverType and Skin Segmentation are large data sets, and Statlog is medium-sized number According to collection, Cmc is small data set, and table 1 is the essential information of experiment four UCI data sets used.
Table 1
Table 2 is the run time of the present invention and traditional k nearest neighbor algorithm on Cmc, and table 3 is that the present invention and traditional algorithm exist Run time on Statlog, table 4 are the run time of the present invention and traditional k nearest neighbor algorithm on Forest CoverType, Table 5 is the run time of the present invention and traditional k nearest neighbor algorithm on Skin Segmentation, run time in each table (including The run time of each test set and average time) unit be the second (s).Table 6 is the survey of the present invention and traditional k nearest neighbor algorithm The average value of examination accuracy compares (%) result.
Table 2
Table 3
Table 4
Table 5
Table 6
The embodiment of the present invention is can be seen that under the premise of keeping classification capacity from the above experimental result, and operation is put down Well below traditional k nearest neighbor algorithm on the equal time, the classification effectiveness of traditional k nearest neighbor algorithm is improved.
Sample classification method through the embodiment of the present invention can be seen that for each test sample, according to determining Section is chosen, training sample is selected from submanifold corresponding with the highest cluster centre of its similarity, uses the instruction selected Practice sample to classify to each test sample, reduces the follow-up training samples number classified, improve big data ring Sample classification efficiency under border;It is determined according to similarity and predetermined threshold value and chooses section, the specific instruction for facilitating adjustment to classify Practice sample, favorable expandability;It by carrying out sub-clustering to training sample set, and determines the cluster centre of each submanifold, can either protect The accuracy of classification is demonstrate,proved, and reduces the quantity of training sample, improves sample classification efficiency;It is counted again after being compressed to each submanifold Cluster centre is calculated, the quantity of training sample is further reduced, further improves classification effectiveness.
Fig. 5 is the schematic diagram of the main modular of sample classification device according to the ... of the embodiment of the present invention.As shown in figure 5, this hair The sample classification device 500 of bright embodiment includes mainly:
Determining module 501, the similarity for calculating test sample and the cluster centre of multiple submanifolds, according to described similar Degree and predetermined threshold value, which determine, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set.To each Before test sample is classified, clustering algorithm need to be used to carry out sub-clustering to training sample set, to obtain multiple submanifolds, and really The cluster centre of fixed each submanifold.It calculates similarity and Euclidean distance, COS distance, Chebyshev's distance etc. can be used;In advance If threshold value can be the numerical value between 0- maximum similarities.The process for choosing section is determined according to the similarity and predetermined threshold value Can be:The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;By highest The similarity increase the threshold value, using the value after increase as the maximum value for choosing section.
Choose module 502, for from the corresponding submanifold of the highest cluster centre of similarity, select in the cluster The similarity of the heart is in the training sample for choosing section.For using Euclidean distance formula to calculate similarity, training is chosen The process of sample is:From the corresponding submanifold of cluster centre of Euclidean distance minimum, the cluster centre of the submanifold is selected Training sample of the Euclidean distance in the selection section.
Sort module 503, for using the training sample selected as new training sample set, with to the test specimens This is classified.Using k nearest neighbor algorithm, the K training sample nearest with test sample is found out from new training sample set, Most of classification is the classification of test sample in this K training sample.Wherein, K self-defined can be arranged, and be traditionally arranged to be Odd number may be configured as 3,5,7 etc. in of the invention.
In addition, the sample classification device 500 of the embodiment of the present invention can also include:Sub-clustering determining module and compression module (being not shown in Fig. 5).Wherein, the sub-clustering determining module, it is multiple to obtain for carrying out sub-clustering to the training sample set The submanifold;And determine the cluster centre of each submanifold.The compression module, for pressing each submanifold Contracting.
From the above, it can be seen that for each test sample, according to determining selection section, from its similarity Select training sample in the corresponding submanifold of highest cluster centre, using the training sample selected to each test sample into Row classification reduces the follow-up training samples number classified, improves the sample classification efficiency under big data environment;According to Similarity and predetermined threshold value, which determine, chooses section, the specific training sample for facilitating adjustment to classify, favorable expandability;By to instruction Practice sample set and carry out sub-clustering, and determine the cluster centre of each submanifold, can either ensure the accuracy of classification, and reduce instruction The quantity for practicing sample, improves sample classification efficiency;Cluster centre is calculated again after being compressed to each submanifold, further reduced instruction The quantity for practicing sample, further improves classification effectiveness.
Fig. 6, which is shown, can apply the sample classification method of the embodiment of the present invention or the exemplary system of sample classification device Framework 600.
As shown in fig. 6, system architecture 600 may include terminal device 601,602,603, network 604 and server 605. Network 604 between terminal device 601,602,603 and server 605 provide communication link medium.Network 604 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted by network 604 with server 605 with using terminal equipment 601,602,603, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603 The application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 601,602,603 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart mobile phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user The shopping class website browsed provides the back-stage management server supported.Back-stage management server can believe the product received The data such as breath inquiry request carry out the processing such as analyzing, and handling result (such as target push information, product information) is fed back to Terminal device.
It should be noted that the sample classification method that the embodiment of the present application is provided generally is executed by server 605, accordingly Ground, sample classification device are generally positioned in server 605.
It should be understood that the number of the terminal device, network and server in Fig. 6 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
According to an embodiment of the invention, the present invention also provides a kind of electronic equipment and a kind of computer-readable medium.
The present invention electronic equipment include:One or more processors;Storage device, for storing one or more journeys Sequence, when one or more of programs are executed by one or more of processors so that one or more of processors are real A kind of sample classification method of the existing embodiment of the present invention.
The computer-readable medium of the present invention, is stored thereon with computer program, real when described program is executed by processor A kind of sample classification method of the existing embodiment of the present invention.
Below with reference to Fig. 7, it illustrates the computer systems 700 suitable for the electronic equipment to realize the embodiment of the present invention Structural schematic diagram.Electronic equipment shown in Fig. 7 is only an example, to the function of the embodiment of the present invention and should not use model Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and Execute various actions appropriate and processing.In RAM 703, also it is stored with computer system 700 and operates required various programs And data.CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705 It is connected to bus 704.
It is connected to I/O interfaces 705 with lower component:Importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.;Storage section 708 including hard disk etc.; And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because The network of spy's net executes communication process.Driver 710 is also according to needing to be connected to I/O interfaces 705.Detachable media 711, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 710, as needed in order to be read from thereon Computer program be mounted into storage section 708 as needed.
Particularly, according to embodiment disclosed by the invention, the process of key step figure description above may be implemented as counting Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program generation for executing method shown in key step figure Code.In such embodiments, which can be downloaded and installed by communications portion 709 from network, and/or It is mounted from detachable media 711.When the computer program is executed by central processing unit (CPU) 701, execute the present invention's The above-mentioned function of being limited in system.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two arbitrarily combines.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or arbitrary above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to:Electrical connection with one or more conducting wires, just It takes formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type and may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, can be any include computer readable storage medium or storage journey The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.And at this In invention, computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By instruction execution system, device either device use or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to:Wirelessly, electric wire, optical cable, RF etc. or above-mentioned Any appropriate combination.
Flow chart in attached drawing and block diagram, it is illustrated that according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part for the part of one unit of table, program segment or code, said units, program segment or code includes one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, this is depended on the functions involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module can also be arranged in the processor, for example, can be described as:A kind of processor packet It includes determining module, choose module and sort module.Wherein, the title of these modules is not constituted under certain conditions to the module The restriction of itself, for example, determining module is also described as " it is similar to the cluster centre of multiple submanifolds calculating test sample Degree determines the module for choosing section according to the similarity and predetermined threshold value ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which can be Included in equipment described in above-described embodiment;Can also be individualism, and without be incorporated the equipment in.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes:The similarity for calculating test sample and the cluster centre of multiple submanifolds, according to the similarity and default threshold Value, which determines, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;It is highest poly- from similarity In the corresponding submanifold in class center, select with the similarity of the cluster centre in the training sample for choosing section;It will choosing The training sample of taking-up is as new training sample set, to classify to the test sample.
From the above, it can be seen that for each test sample, according to determining selection section, from its similarity Select training sample in the corresponding submanifold of highest cluster centre, using the training sample selected to each test sample into Row classification reduces the follow-up training samples number classified, improves the sample classification efficiency under big data environment;According to Similarity and predetermined threshold value, which determine, chooses section, the specific training sample for facilitating adjustment to classify, favorable expandability;By to instruction Practice sample set and carry out sub-clustering, and determine the cluster centre of each submanifold, can either ensure the accuracy of classification, and reduce instruction The quantity for practicing sample, improves sample classification efficiency;Cluster centre is calculated again after being compressed to each submanifold, further reduced instruction The quantity for practicing sample, further improves classification effectiveness.
The said goods can perform the method that the embodiment of the present invention is provided, and has the corresponding function module of execution method and has Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to the method that the embodiment of the present invention is provided.
Above-mentioned specific implementation mode, does not constitute limiting the scope of the invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and replacement can occur.It is any Modifications, equivalent substitutions and improvements made by within the spirit and principles in the present invention etc., should be included in the scope of the present invention Within.

Claims (12)

1. a kind of sample classification method, which is characterized in that including:
The similarity for calculating test sample and the cluster centre of multiple submanifolds is determined according to the similarity and predetermined threshold value and is chosen Section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;
From the corresponding submanifold of the highest cluster centre of similarity, select with the similarity of the cluster centre in the selection The training sample in section;
Using the training sample selected as new training sample set, to classify to the test sample.
2. according to the method described in claim 1, it is characterized in that, described determined according to the similarity and predetermined threshold value is chosen Section, including:
The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;
The highest similarity is increased into the threshold value, using the value after increase as the maximum value for choosing section.
3. method according to claim 1 or 2, which is characterized in that the cluster for calculating test sample and multiple submanifolds Before the step of similarity at center, further include:
Sub-clustering is carried out to the training sample set, to obtain multiple submanifolds;
Determine the cluster centre of each submanifold.
4. according to the method described in claim 3, it is characterized in that, the determination each the cluster centre of the submanifold the step of Before, further include:Each submanifold is compressed;
The cluster centre of each submanifold of the determination, including:Determine the cluster centre of compressed each submanifold.
5. according to the method described in claim 4, it is characterized in that, in the cluster of the compressed each submanifold of the determination The heart, including:The coordinate average value of all training samples in compressed each submanifold is calculated, the coordinate average value is The coordinate of the cluster centre of compressed each submanifold.
6. a kind of sample classification device, which is characterized in that including:
Determining module, the similarity for calculating test sample and the cluster centre of multiple submanifolds, according to the similarity and in advance If threshold value, which determines, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;
Module is chosen, for from the corresponding submanifold of the highest cluster centre of similarity, selecting the phase with the cluster centre Like degree in the training sample for choosing section;
Sort module, for using the training sample selected as new training sample set, to be carried out to the test sample Classification.
7. device according to claim 6, which is characterized in that the determining module is additionally operable to:
The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;And
The highest similarity is increased into the threshold value, using the value after increase as the maximum value for choosing section.
8. the device described according to claim 6 or 7, which is characterized in that described device further includes:Sub-clustering determining module, is used for Sub-clustering is carried out to the training sample set, to obtain multiple submanifolds;And determine the cluster centre of each submanifold.
9. device according to claim 8, which is characterized in that described device further includes:Compression module, for each institute Submanifold is stated to be compressed;
The sub-clustering determining module is additionally operable to determine the cluster centre of compressed each submanifold.
10. device according to claim 9, which is characterized in that the sub-clustering determining module is additionally operable to:After calculating compression Each of the coordinate average value of all training samples in the submanifold, the coordinate average value be compressed each son The coordinate of the cluster centre of cluster.
11. a kind of electronic equipment, which is characterized in that including:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors are real The now method as described in any in claim 1-5.
12. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1-5 is realized when row.
CN201810487963.7A 2018-05-21 2018-05-21 A kind of sample classification method and apparatus Pending CN108764319A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810487963.7A CN108764319A (en) 2018-05-21 2018-05-21 A kind of sample classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810487963.7A CN108764319A (en) 2018-05-21 2018-05-21 A kind of sample classification method and apparatus

Publications (1)

Publication Number Publication Date
CN108764319A true CN108764319A (en) 2018-11-06

Family

ID=64007388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810487963.7A Pending CN108764319A (en) 2018-05-21 2018-05-21 A kind of sample classification method and apparatus

Country Status (1)

Country Link
CN (1) CN108764319A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109682620A (en) * 2018-12-06 2019-04-26 郭思 A kind of appraisal procedure of domestic air conditioner refrigerating efficiency
CN110909824A (en) * 2019-12-09 2020-03-24 天津开心生活科技有限公司 Test data checking method and device, storage medium and electronic equipment
CN111767735A (en) * 2019-03-26 2020-10-13 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for executing task
CN112508134A (en) * 2021-02-02 2021-03-16 贝壳找房(北京)科技有限公司 Method, device, medium and electronic equipment for measuring similarity between sets
CN113590677A (en) * 2021-07-14 2021-11-02 上海淇玥信息技术有限公司 Data processing method and device and electronic equipment
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium
WO2022121801A1 (en) * 2020-12-07 2022-06-16 北京有竹居网络技术有限公司 Information processing method and apparatus, and electronic device
CN114662607A (en) * 2022-03-31 2022-06-24 北京百度网讯科技有限公司 Data annotation method, device and equipment based on artificial intelligence and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109682620A (en) * 2018-12-06 2019-04-26 郭思 A kind of appraisal procedure of domestic air conditioner refrigerating efficiency
CN109682620B (en) * 2018-12-06 2020-10-27 郭思 Method for evaluating refrigeration efficiency of household air conditioner
CN111767735A (en) * 2019-03-26 2020-10-13 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for executing task
CN110909824A (en) * 2019-12-09 2020-03-24 天津开心生活科技有限公司 Test data checking method and device, storage medium and electronic equipment
CN110909824B (en) * 2019-12-09 2022-10-28 天津开心生活科技有限公司 Test data checking method and device, storage medium and electronic equipment
WO2022121801A1 (en) * 2020-12-07 2022-06-16 北京有竹居网络技术有限公司 Information processing method and apparatus, and electronic device
CN112508134A (en) * 2021-02-02 2021-03-16 贝壳找房(北京)科技有限公司 Method, device, medium and electronic equipment for measuring similarity between sets
CN112508134B (en) * 2021-02-02 2021-06-04 贝壳找房(北京)科技有限公司 Method, device, medium and electronic equipment for measuring similarity between sets
CN113590677A (en) * 2021-07-14 2021-11-02 上海淇玥信息技术有限公司 Data processing method and device and electronic equipment
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium
CN114662607A (en) * 2022-03-31 2022-06-24 北京百度网讯科技有限公司 Data annotation method, device and equipment based on artificial intelligence and storage medium

Similar Documents

Publication Publication Date Title
CN108764319A (en) A kind of sample classification method and apparatus
WO2022126971A1 (en) Density-based text clustering method and apparatus, device, and storage medium
US10547618B2 (en) Method and apparatus for setting access privilege, server and storage medium
CN108090162A (en) Information-pushing method and device based on artificial intelligence
CN108629823A (en) The generation method and device of multi-view image
CN110827924B (en) Clustering method and device for gene expression data, computer equipment and storage medium
CN108171191A (en) For detecting the method and apparatus of face
CN108537291A (en) A kind of sample classification method and apparatus
CN112463859B (en) User data processing method and server based on big data and business analysis
CN112365202A (en) Method for screening evaluation factors of multi-target object and related equipment thereof
CN111695840A (en) Method and device for realizing flow control
CN110443264A (en) A kind of method and apparatus of cluster
CN108615006A (en) Method and apparatus for output information
CN111415196A (en) Advertisement recall method, device, server and storage medium
CN108595211A (en) Method and apparatus for output data
CN107968743A (en) The method and apparatus of pushed information
CN110503117A (en) The method and apparatus of data clusters
CN110263791A (en) A kind of method and apparatus in identification function area
CN110298371A (en) The method and apparatus of data clusters
CN111400663B (en) Model training method, device, equipment and computer readable storage medium
CN113472860A (en) Service resource allocation method and server under big data and digital environment
CN110532448B (en) Document classification method, device, equipment and storage medium based on neural network
CN108062576B (en) Method and apparatus for output data
CN109754273A (en) The method and apparatus for promoting any active ues quantity
CN110019531A (en) A kind of method and apparatus obtaining analogical object set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181106