CN108764319A - A kind of sample classification method and apparatus - Google Patents
A kind of sample classification method and apparatus Download PDFInfo
- Publication number
- CN108764319A CN108764319A CN201810487963.7A CN201810487963A CN108764319A CN 108764319 A CN108764319 A CN 108764319A CN 201810487963 A CN201810487963 A CN 201810487963A CN 108764319 A CN108764319 A CN 108764319A
- Authority
- CN
- China
- Prior art keywords
- submanifold
- similarity
- cluster centre
- sample
- training sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of sample classification method and apparatus, are related to field of computer technology.One specific implementation mode of this method includes:The similarity for calculating test sample and the cluster centre of multiple submanifolds determines according to the similarity and predetermined threshold value and chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;From the corresponding submanifold of the highest cluster centre of similarity, select with the similarity of the cluster centre in the training sample for choosing section;Using the training sample selected as new training sample set, to classify to the test sample.This method is for each test sample, according to determining selection section, training sample is selected from submanifold corresponding with the highest cluster centre of its similarity, classified to each test sample using the training sample selected, reduce the follow-up training samples number classified, improves the sample classification efficiency under big data environment.
Description
Technical field
The present invention relates to computer realm more particularly to a kind of sample classification method and apparatus.
Background technology
K nearest neighbor algorithm is widely used in many fields, such as recognition of face, gene point because it is simple and is easily achieved
Class, decision support etc..The basic thought of k nearest neighbor algorithm is:For given test sample x, found in training sample set
Its K nearest samples, and determine according to the classification of this K nearest samples the classification of test sample x.
In realizing process of the present invention, inventor has found that at least there are the following problems in the prior art:K nearest neighbor algorithm is being sought
During looking for nearest samples, need to calculate test sample one by one at a distance from each training sample in training sample set
(or similarity), when training sample set is combined into big data, above-mentioned calculating process will produce very high expense, lead to algorithm
Efficiency becomes very low or even infeasible.
Invention content
In view of this, a kind of sample classification method and apparatus of offer of the embodiment of the present invention press each test sample
According to determining selection section, training sample is selected from submanifold corresponding with the highest cluster centre of its similarity, uses choosing
The training sample of taking-up classifies to each test sample, improves the sample classification efficiency under big data environment.
To achieve the above object, one side according to the ... of the embodiment of the present invention provides a kind of sample classification method.
A kind of sample classification method of the embodiment of the present invention, including:Calculate the cluster centre of test sample and multiple submanifolds
Similarity, determined according to the similarity and predetermined threshold value and choose section;Wherein, the submanifold be to training sample set into
It is obtained after row sub-clustering;From the corresponding submanifold of the highest cluster centre of similarity, select similar to the cluster centre
It spends in the training sample for choosing section;Using the training sample selected as new training sample set, with to the survey
Sample is originally classified.
Optionally, described determined according to the similarity and predetermined threshold value chooses section, including:It will be highest described similar
Degree reduces predetermined threshold value, using the value after reduction as the minimum value for choosing section;The highest similarity is increased into the threshold
Value, using the value after increase as the maximum value for choosing section.
Optionally, before described the step of calculating test sample and the similarity of the cluster centre of multiple submanifolds, further include:
Sub-clustering is carried out to the training sample set, to obtain multiple submanifolds;Determine the cluster centre of each submanifold.
Optionally, the determination each the cluster centre of the submanifold the step of before, further include:To each submanifold
It is compressed;The cluster centre of each submanifold of the determination, including:In the cluster for determining compressed each submanifold
The heart.
Optionally, the cluster centre of the compressed each submanifold of the determination, including:Calculate compressed each institute
The coordinate average value of all training samples in submanifold is stated, the coordinate average value is the cluster of compressed each submanifold
The coordinate at center.
To achieve the above object, another aspect according to the ... of the embodiment of the present invention provides a kind of sample classification device.
A kind of sample classification device of the embodiment of the present invention, including:Determining module, for calculating test sample and multiple sons
The similarity of the cluster centre of cluster determines according to the similarity and predetermined threshold value and chooses section;Wherein, the submanifold is to instruction
It is obtained after white silk sample set progress sub-clustering;Module is chosen, for from the corresponding submanifold of the highest cluster centre of similarity, selecting
It takes out with the similarity of the cluster centre in the training sample for choosing section;Sort module, the instruction for will select
Practice sample as new training sample set, to classify to the test sample.
Optionally, the determining module, is additionally operable to:The highest similarity is reduced into predetermined threshold value, after reduction
It is worth as the minimum value for choosing section;And the highest similarity is increased into the threshold value, using the value after increase as institute
State the maximum value for choosing section.
Optionally, described device further includes:Sub-clustering determining module, for carrying out sub-clustering to the training sample set, with
Obtain multiple submanifolds;And determine the cluster centre of each submanifold.
Optionally, described device further includes:Compression module, for being compressed to each submanifold;The sub-clustering is true
Cover half block is additionally operable to determine the cluster centre of compressed each submanifold.
Optionally, the sub-clustering determining module, is additionally operable to:Calculate all training samples in compressed each submanifold
Coordinate average value, the coordinate average value is the coordinate of the cluster centre of compressed each submanifold.
To achieve the above object, according to the ... of the embodiment of the present invention in another aspect, providing a kind of electronic equipment.
The a kind of electronic equipment of the embodiment of the present invention, including:One or more processors;Storage device, for storing one
A or multiple programs, when one or more of programs are executed by one or more of processors so that one or more
A processor realizes a kind of sample classification method of the embodiment of the present invention.
To achieve the above object, according to the ... of the embodiment of the present invention in another aspect, providing a kind of computer-readable medium.
A kind of computer-readable medium of the embodiment of the present invention, is stored thereon with computer program, and described program is handled
A kind of sample classification method of the embodiment of the present invention is realized when device executes.
One embodiment in foregoing invention has the following advantages that or advantageous effect:For each test sample, according to
Determining selection section selects training sample from submanifold corresponding with the highest cluster centre of its similarity, uses selection
The training sample gone out classifies to each test sample, reduces the follow-up training samples number classified, and improves big
Sample classification efficiency under data environment;It is determined according to similarity and predetermined threshold value and chooses section, adjustment is facilitated to classify
Specific training sample, favorable expandability;By carrying out sub-clustering to training sample set, and determine the cluster centre of each submanifold, both
It can ensure the accuracy of classification, and reduce the quantity of training sample, improve sample classification efficiency;Each submanifold is compressed
It calculates cluster centre again afterwards, further reduced the quantity of training sample, further improve classification effectiveness.
Further effect possessed by above-mentioned non-usual optional mode adds hereinafter in conjunction with specific implementation mode
With explanation.
Description of the drawings
Attached drawing does not constitute inappropriate limitation of the present invention for more fully understanding the present invention.Wherein:
Fig. 1 is the schematic diagram of the key step of sample classification method according to the ... of the embodiment of the present invention;
Fig. 2 is the main flow schematic diagram of sample classification method according to the ... of the embodiment of the present invention;
Fig. 3 is the principle of classification schematic diagram of the sample classification method of the embodiment of the present invention;
Fig. 4 is the classification results schematic diagram of the sample classification method of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the main modular of sample classification device according to the ... of the embodiment of the present invention;
Fig. 6 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 7 is the structural schematic diagram for the computer installation for being suitable for the electronic equipment to realize the embodiment of the present invention.
Specific implementation mode
It explains to the exemplary embodiment of the present invention below in conjunction with attached drawing, including the various of the embodiment of the present invention
Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize
It arrives, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together
The description to known function and structure is omitted for clarity and conciseness in sample in following description.
Fig. 1 is the schematic diagram of the key step of sample classification method according to the ... of the embodiment of the present invention.As shown in Figure 1, this hair
The sample classification method of bright embodiment, mainly includes the following steps:
Step S101:The similarity for calculating test sample and the cluster centre of multiple submanifolds, according to the similarity and in advance
If threshold value, which determines, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set.To each test specimens
Before this is classified, clustering algorithm need to be used to carry out sub-clustering to training sample set, to obtain multiple submanifolds, and determined each
The cluster centre of the submanifold.It calculates similarity and Euclidean distance, COS distance, Chebyshev's distance etc. can be used;Predetermined threshold value
Can be the numerical value between 0- maximum similarities.Determine that the process for choosing section can be with according to the similarity and predetermined threshold value
For:The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;By highest institute
It states similarity and increases the threshold value, using the value after increase as the maximum value for choosing section.
Step S102:From the corresponding submanifold of the highest cluster centre of similarity, the phase with the cluster centre is selected
Like degree in the training sample for choosing section.For using Euclidean distance formula to calculate similarity, training sample is chosen
Process is:From the corresponding submanifold of cluster centre of Euclidean distance minimum, select the cluster centre of the submanifold it is European away from
From the training sample in the selection section.
Step S103:Using the training sample selected as new training sample set, to be carried out to the test sample
Classification.Using k nearest neighbor algorithm, the K training sample nearest with test sample is found out from new training sample set, this K is a
Most of classification is the classification of test sample in training sample.Wherein, K self-defined can be arranged, and be traditionally arranged to be odd number,
It may be configured as 3,5,7 etc. in the present invention.
Fig. 2 is the main flow schematic diagram of sample classification method according to the ... of the embodiment of the present invention.As shown in Fig. 2, of the invention
The sample classification method of embodiment, mainly includes the following steps:
Step S201:Sub-clustering is carried out to training sample set using clustering algorithm, to obtain multiple submanifolds.Common cluster
It is noisy that algorithm can be used in K mean values (K-means) algorithm in the present invention, such as partition clustering, the tool in Density Clustering
Density clustering method (Density-Based Spatial Clustering of Applications with
Noise, DBSCAN), the gauss hybrid models (GMM) etc. in Model tying, to keep higher classification accuracy.By sample set
It closes and is divided into training sample set and test sample set according to preset ratio, which such as can be 7:3, training sample set
Include multiple training samples, test sample set includes multiple test samples.There is no training process for k nearest neighbor algorithm
Feature, this step have carried out sub-clustering using clustering algorithm to training sample set, that is, introduce a training process.
Step S202:Each submanifold is compressed to obtain compression cluster, in the cluster for determining each compression cluster
The heart.Each submanifold is compressed using compression nearest neighbour method or editing nearest neighbour method, so that each submanifold is retaining at least
Under conditions of measuring training sample, remain to correctly classify to whole training samples in submanifold with k nearest neighbor algorithm.It determines each described
The process of cluster centre for compressing cluster is specially:The coordinate for calculating all training samples in compressed each submanifold is average
Value, the coordinate average value are the coordinate of the cluster centre of compressed each submanifold.
Wherein, compression nearest neighbour method can greatly reduce the number of sample set, and the detailed process of the algorithm is:
(1) training set R is divided into two sample sets of A and B, and it is sky that original training set, which closes A,.
(2) it randomly chooses a sample from training set R to be put into A, other samples are put into B, with it to each in B
A sample is classified.If sample i can correctly be classified (classification predicted is identical as the classification of sample itself), by it
It is put back into B;Otherwise it adds it in A.
(3) it repeats the above process, until all samples can be by correct classify in B.
The principle of editing nearest neighbour method is:Given training set R and classifying rules C, if S is classified regular C mistakes classification
These samples are deleted from training set R, obtain R=R-S by sample set.It repeats the above process, until meeting stopping criterion.
At the end of the above process, the sample in training set R is all the sample correctly classified by classifying rules C.The detailed process of the algorithm
For:
(1) training set R is randomly divided into N groups.
(2) using the union of remaining (N-1) group sample set as training set, to each sample in i-th group of sample set
Carry out k nearest neighbor classification.If classification error, it is added in set S, wherein i=1,2 ..., N;
(3) from the sample deleted in training set R in collection and S, new data set R=R-S is formed.It repeats the above process, directly
Occur to the sample for not having mistake to classify in nearest I iterative process.
The compression cluster obtained using compression nearest neighbour method is above-mentioned sample set A, the compression obtained using editing nearest neighbour method
Cluster is the training set R after the sample set of deletion error classification.
In a preferred embodiment, it is assumed that there are three training sample, the coordinates point of these three training samples in compression cluster
It is not:(x1, y1), (x2, y2), (x3, y3), then the coordinate of the cluster centre of these three training samples be:((x1+x2+x3)/
3, (y1+y2+y3)/3).
Step S203:It calculates in each compression cluster, the similarity of each training sample and corresponding cluster centre, and
It is ranked up according to preset order.Wherein, the sequence is by User Defined.In a preferred embodiment, using it is European away from
From similarity is calculated, specially:It calculates in compression cluster, the Euclidean distance of each training sample to corresponding cluster centre, and presses
It is ranked up according to ascending sequence.
Step S204:The similarity for calculating test sample and the cluster centre of each compression cluster, to find out similarity
The corresponding compression cluster of highest cluster centre.In a preferred embodiment, similarity is calculated using Euclidean distance, specially:
The Euclidean distance for calculating test sample and the cluster centre of each compression cluster, finds out the cluster centre pair of Euclidean distance minimum
The compression cluster answered.
Step S205:At least one training sample is selected from the corresponding compression cluster of the highest cluster centre of similarity.
The step is needed according to the similarity d and predetermined threshold value m between test sample and the highest cluster centre of similarity, really first
Surely section T is chosen;Later from the corresponding compression cluster of the highest cluster centre of similarity, select and corresponding cluster centre
Similarity is in the training sample for choosing section T.
In a preferred embodiment, it is assumed that minimum Euclidean distance is d, predetermined threshold value m, then chooses section T=[d-
M, d+m], 0<m<d;From the corresponding compression cluster of Euclidean distance of the minimum, the Europe of the cluster centre of the compression cluster is selected
Formula distance is all training samples of [d-m, d+m].
Step S206:Using the training sample selected as new training sample set, to use k nearest neighbor algorithm to described
Test sample is classified.K nearest neighbor sequence is carried out to test sample using new training sample set, according to the K values and K of setting
The classification of most training samples, determines and exports the prediction classification of test sample in a arest neighbors training sample.Repeat step
S204 to step S206, to classify to each test sample in test sample set.
Fig. 3 is the principle of classification schematic diagram of the sample classification method of the embodiment of the present invention.The schematic diagram corresponds to step S201
With step S202.As shown in figure 3, training sample set is divided into 5 submanifolds, C1 is corresponded to C5;To each submanifold respectively into
Row compression obtains 5 compression clusters, corresponds to C'1To C'5;The cluster centre for calculating separately each compression cluster, corresponds to O1 to O5.
Assuming that having 1000 training samples in training sample set, through K-means algorithms sub-clustering (assuming that K=5), then in each submanifold
There are 200 training samples;Each compressed nearest neighbor algorithm compression (assuming that compression ratio is 10%) of submanifold, then each in compression cluster
Only 20 training samples.
Fig. 4 is the classification results schematic diagram of the sample classification method of the embodiment of the present invention.As shown in figure 4, intermediate black
Circle is the cluster centre O of the minimum compression cluster of the Euclidean distance with test sample, and open circles are test sample D, cluster centre O
Euclidean distance with test sample D is d (i.e. the length of OD).It is looked on the line (extended line) of cluster centre O and test sample D
Go out at a distance from test sample D be m two points, the two point be respectively E and F, then using cluster centre O as the center of circle, OE and
The length of OF is that radius work is justified, i.e. the radius of the two circles is respectively d-m and d+m, is chosen in the compression cluster between d-m and d+m
The training sample of (between i.e. two circles) is as new training sample set.In a preferred embodiment, it is with test sample D
The training sample in the circle that radius is m is chosen as new training sample set in the center of circle.
In an additional preferred embodiment, the sequence of step S201 and step S202 can be interchanged, i.e., first close using compression
Adjacent method or editing nearest neighbour method compress training sample set, obtain a compression cluster;Use clustering algorithm to pressure later
Contracting cluster carries out sub-clustering, to obtain multiple submanifolds.The sequence of step S203 and step S204 can be interchanged, i.e., first calculate test sample
With it is each it is described compression cluster cluster centre similarity, to find out the corresponding compression cluster of the highest cluster centre of similarity;It
It calculates afterwards in the corresponding compression cluster of the highest cluster centre of similarity, the similarity of each training sample to the cluster centre, and
It is ranked up according to preset order.
In order to verify the validity of sample classification method proposed by the present invention, we are in four UCI (University of
CaliforniaIrvine, University of California at Irvine) on data set with having been carried out about operation based on traditional k nearest neighbor algorithm when
Between and the experiment of measuring accuracy compare.This four UCI data sets are respectively Forest CoverType (drymion data
Collection), Skin Segmentation (partitioning into skin data set), Statlog (German credit card information collection) and Cmc (Canadian gas
As the global snow depth raster dataset being centrally generated).Wherein, UCI data sets are University of California at Irvine propositions for machine
The data set of device study, Forest CoverType and Skin Segmentation are large data sets, and Statlog is medium-sized number
According to collection, Cmc is small data set, and table 1 is the essential information of experiment four UCI data sets used.
Table 1
Table 2 is the run time of the present invention and traditional k nearest neighbor algorithm on Cmc, and table 3 is that the present invention and traditional algorithm exist
Run time on Statlog, table 4 are the run time of the present invention and traditional k nearest neighbor algorithm on Forest CoverType,
Table 5 is the run time of the present invention and traditional k nearest neighbor algorithm on Skin Segmentation, run time in each table (including
The run time of each test set and average time) unit be the second (s).Table 6 is the survey of the present invention and traditional k nearest neighbor algorithm
The average value of examination accuracy compares (%) result.
Table 2
Table 3
Table 4
Table 5
Table 6
The embodiment of the present invention is can be seen that under the premise of keeping classification capacity from the above experimental result, and operation is put down
Well below traditional k nearest neighbor algorithm on the equal time, the classification effectiveness of traditional k nearest neighbor algorithm is improved.
Sample classification method through the embodiment of the present invention can be seen that for each test sample, according to determining
Section is chosen, training sample is selected from submanifold corresponding with the highest cluster centre of its similarity, uses the instruction selected
Practice sample to classify to each test sample, reduces the follow-up training samples number classified, improve big data ring
Sample classification efficiency under border;It is determined according to similarity and predetermined threshold value and chooses section, the specific instruction for facilitating adjustment to classify
Practice sample, favorable expandability;It by carrying out sub-clustering to training sample set, and determines the cluster centre of each submanifold, can either protect
The accuracy of classification is demonstrate,proved, and reduces the quantity of training sample, improves sample classification efficiency;It is counted again after being compressed to each submanifold
Cluster centre is calculated, the quantity of training sample is further reduced, further improves classification effectiveness.
Fig. 5 is the schematic diagram of the main modular of sample classification device according to the ... of the embodiment of the present invention.As shown in figure 5, this hair
The sample classification device 500 of bright embodiment includes mainly:
Determining module 501, the similarity for calculating test sample and the cluster centre of multiple submanifolds, according to described similar
Degree and predetermined threshold value, which determine, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set.To each
Before test sample is classified, clustering algorithm need to be used to carry out sub-clustering to training sample set, to obtain multiple submanifolds, and really
The cluster centre of fixed each submanifold.It calculates similarity and Euclidean distance, COS distance, Chebyshev's distance etc. can be used;In advance
If threshold value can be the numerical value between 0- maximum similarities.The process for choosing section is determined according to the similarity and predetermined threshold value
Can be:The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;By highest
The similarity increase the threshold value, using the value after increase as the maximum value for choosing section.
Choose module 502, for from the corresponding submanifold of the highest cluster centre of similarity, select in the cluster
The similarity of the heart is in the training sample for choosing section.For using Euclidean distance formula to calculate similarity, training is chosen
The process of sample is:From the corresponding submanifold of cluster centre of Euclidean distance minimum, the cluster centre of the submanifold is selected
Training sample of the Euclidean distance in the selection section.
Sort module 503, for using the training sample selected as new training sample set, with to the test specimens
This is classified.Using k nearest neighbor algorithm, the K training sample nearest with test sample is found out from new training sample set,
Most of classification is the classification of test sample in this K training sample.Wherein, K self-defined can be arranged, and be traditionally arranged to be
Odd number may be configured as 3,5,7 etc. in of the invention.
In addition, the sample classification device 500 of the embodiment of the present invention can also include:Sub-clustering determining module and compression module
(being not shown in Fig. 5).Wherein, the sub-clustering determining module, it is multiple to obtain for carrying out sub-clustering to the training sample set
The submanifold;And determine the cluster centre of each submanifold.The compression module, for pressing each submanifold
Contracting.
From the above, it can be seen that for each test sample, according to determining selection section, from its similarity
Select training sample in the corresponding submanifold of highest cluster centre, using the training sample selected to each test sample into
Row classification reduces the follow-up training samples number classified, improves the sample classification efficiency under big data environment;According to
Similarity and predetermined threshold value, which determine, chooses section, the specific training sample for facilitating adjustment to classify, favorable expandability;By to instruction
Practice sample set and carry out sub-clustering, and determine the cluster centre of each submanifold, can either ensure the accuracy of classification, and reduce instruction
The quantity for practicing sample, improves sample classification efficiency;Cluster centre is calculated again after being compressed to each submanifold, further reduced instruction
The quantity for practicing sample, further improves classification effectiveness.
Fig. 6, which is shown, can apply the sample classification method of the embodiment of the present invention or the exemplary system of sample classification device
Framework 600.
As shown in fig. 6, system architecture 600 may include terminal device 601,602,603, network 604 and server 605.
Network 604 between terminal device 601,602,603 and server 605 provide communication link medium.Network 604 can be with
Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be interacted by network 604 with server 605 with using terminal equipment 601,602,603, to receive or send out
Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 601,602,603
The application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 601,602,603 can be the various electronic equipments with display screen and supported web page browsing, packet
Include but be not limited to smart mobile phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 605 can be to provide the server of various services, such as utilize terminal device 601,602,603 to user
The shopping class website browsed provides the back-stage management server supported.Back-stage management server can believe the product received
The data such as breath inquiry request carry out the processing such as analyzing, and handling result (such as target push information, product information) is fed back to
Terminal device.
It should be noted that the sample classification method that the embodiment of the present application is provided generally is executed by server 605, accordingly
Ground, sample classification device are generally positioned in server 605.
It should be understood that the number of the terminal device, network and server in Fig. 6 is only schematical.According to realization need
It wants, can have any number of terminal device, network and server.
According to an embodiment of the invention, the present invention also provides a kind of electronic equipment and a kind of computer-readable medium.
The present invention electronic equipment include:One or more processors;Storage device, for storing one or more journeys
Sequence, when one or more of programs are executed by one or more of processors so that one or more of processors are real
A kind of sample classification method of the existing embodiment of the present invention.
The computer-readable medium of the present invention, is stored thereon with computer program, real when described program is executed by processor
A kind of sample classification method of the existing embodiment of the present invention.
Below with reference to Fig. 7, it illustrates the computer systems 700 suitable for the electronic equipment to realize the embodiment of the present invention
Structural schematic diagram.Electronic equipment shown in Fig. 7 is only an example, to the function of the embodiment of the present invention and should not use model
Shroud carrys out any restrictions.
As shown in fig. 7, computer system 700 includes central processing unit (CPU) 701, it can be read-only according to being stored in
Program in memory (ROM) 702 or be loaded into the program in random access storage device (RAM) 703 from storage section 708 and
Execute various actions appropriate and processing.In RAM 703, also it is stored with computer system 700 and operates required various programs
And data.CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input/output (I/O) interface 705
It is connected to bus 704.
It is connected to I/O interfaces 705 with lower component:Importation 706 including keyboard, mouse etc.;It is penetrated including such as cathode
The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.;Storage section 708 including hard disk etc.;
And the communications portion 709 of the network interface card including LAN card, modem etc..Communications portion 709 via such as because
The network of spy's net executes communication process.Driver 710 is also according to needing to be connected to I/O interfaces 705.Detachable media 711, such as
Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on driver 710, as needed in order to be read from thereon
Computer program be mounted into storage section 708 as needed.
Particularly, according to embodiment disclosed by the invention, the process of key step figure description above may be implemented as counting
Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer
Computer program on readable medium, the computer program include the program generation for executing method shown in key step figure
Code.In such embodiments, which can be downloaded and installed by communications portion 709 from network, and/or
It is mounted from detachable media 711.When the computer program is executed by central processing unit (CPU) 701, execute the present invention's
The above-mentioned function of being limited in system.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter
Calculation machine readable storage medium storing program for executing either the two arbitrarily combines.Computer readable storage medium for example can be --- but not
Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or arbitrary above combination.Meter
The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to:Electrical connection with one or more conducting wires, just
It takes formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type and may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In the present invention, can be any include computer readable storage medium or storage journey
The tangible medium of sequence, the program can be commanded the either device use or in connection of execution system, device.And at this
In invention, computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated,
Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By instruction execution system, device either device use or program in connection.Include on computer-readable medium
Program code can transmit with any suitable medium, including but not limited to:Wirelessly, electric wire, optical cable, RF etc. or above-mentioned
Any appropriate combination.
Flow chart in attached drawing and block diagram, it is illustrated that according to the system of various embodiments of the invention, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part for the part of one unit of table, program segment or code, said units, program segment or code includes one or more
Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box
The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical
On can be basically executed in parallel, they can also be executed in the opposite order sometimes, this is depended on the functions involved.Also it wants
It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule
The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction
It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard
The mode of part is realized.Described module can also be arranged in the processor, for example, can be described as:A kind of processor packet
It includes determining module, choose module and sort module.Wherein, the title of these modules is not constituted under certain conditions to the module
The restriction of itself, for example, determining module is also described as " it is similar to the cluster centre of multiple submanifolds calculating test sample
Degree determines the module for choosing section according to the similarity and predetermined threshold value ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which can be
Included in equipment described in above-described embodiment;Can also be individualism, and without be incorporated the equipment in.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes
Obtaining the equipment includes:The similarity for calculating test sample and the cluster centre of multiple submanifolds, according to the similarity and default threshold
Value, which determines, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;It is highest poly- from similarity
In the corresponding submanifold in class center, select with the similarity of the cluster centre in the training sample for choosing section;It will choosing
The training sample of taking-up is as new training sample set, to classify to the test sample.
From the above, it can be seen that for each test sample, according to determining selection section, from its similarity
Select training sample in the corresponding submanifold of highest cluster centre, using the training sample selected to each test sample into
Row classification reduces the follow-up training samples number classified, improves the sample classification efficiency under big data environment;According to
Similarity and predetermined threshold value, which determine, chooses section, the specific training sample for facilitating adjustment to classify, favorable expandability;By to instruction
Practice sample set and carry out sub-clustering, and determine the cluster centre of each submanifold, can either ensure the accuracy of classification, and reduce instruction
The quantity for practicing sample, improves sample classification efficiency;Cluster centre is calculated again after being compressed to each submanifold, further reduced instruction
The quantity for practicing sample, further improves classification effectiveness.
The said goods can perform the method that the embodiment of the present invention is provided, and has the corresponding function module of execution method and has
Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to the method that the embodiment of the present invention is provided.
Above-mentioned specific implementation mode, does not constitute limiting the scope of the invention.Those skilled in the art should be bright
It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and replacement can occur.It is any
Modifications, equivalent substitutions and improvements made by within the spirit and principles in the present invention etc., should be included in the scope of the present invention
Within.
Claims (12)
1. a kind of sample classification method, which is characterized in that including:
The similarity for calculating test sample and the cluster centre of multiple submanifolds is determined according to the similarity and predetermined threshold value and is chosen
Section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;
From the corresponding submanifold of the highest cluster centre of similarity, select with the similarity of the cluster centre in the selection
The training sample in section;
Using the training sample selected as new training sample set, to classify to the test sample.
2. according to the method described in claim 1, it is characterized in that, described determined according to the similarity and predetermined threshold value is chosen
Section, including:
The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;
The highest similarity is increased into the threshold value, using the value after increase as the maximum value for choosing section.
3. method according to claim 1 or 2, which is characterized in that the cluster for calculating test sample and multiple submanifolds
Before the step of similarity at center, further include:
Sub-clustering is carried out to the training sample set, to obtain multiple submanifolds;
Determine the cluster centre of each submanifold.
4. according to the method described in claim 3, it is characterized in that, the determination each the cluster centre of the submanifold the step of
Before, further include:Each submanifold is compressed;
The cluster centre of each submanifold of the determination, including:Determine the cluster centre of compressed each submanifold.
5. according to the method described in claim 4, it is characterized in that, in the cluster of the compressed each submanifold of the determination
The heart, including:The coordinate average value of all training samples in compressed each submanifold is calculated, the coordinate average value is
The coordinate of the cluster centre of compressed each submanifold.
6. a kind of sample classification device, which is characterized in that including:
Determining module, the similarity for calculating test sample and the cluster centre of multiple submanifolds, according to the similarity and in advance
If threshold value, which determines, chooses section;Wherein, the submanifold obtains after carrying out sub-clustering to training sample set;
Module is chosen, for from the corresponding submanifold of the highest cluster centre of similarity, selecting the phase with the cluster centre
Like degree in the training sample for choosing section;
Sort module, for using the training sample selected as new training sample set, to be carried out to the test sample
Classification.
7. device according to claim 6, which is characterized in that the determining module is additionally operable to:
The highest similarity is reduced into predetermined threshold value, using the value after reduction as the minimum value for choosing section;And
The highest similarity is increased into the threshold value, using the value after increase as the maximum value for choosing section.
8. the device described according to claim 6 or 7, which is characterized in that described device further includes:Sub-clustering determining module, is used for
Sub-clustering is carried out to the training sample set, to obtain multiple submanifolds;And determine the cluster centre of each submanifold.
9. device according to claim 8, which is characterized in that described device further includes:Compression module, for each institute
Submanifold is stated to be compressed;
The sub-clustering determining module is additionally operable to determine the cluster centre of compressed each submanifold.
10. device according to claim 9, which is characterized in that the sub-clustering determining module is additionally operable to:After calculating compression
Each of the coordinate average value of all training samples in the submanifold, the coordinate average value be compressed each son
The coordinate of the cluster centre of cluster.
11. a kind of electronic equipment, which is characterized in that including:
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors so that one or more of processors are real
The now method as described in any in claim 1-5.
12. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor
The method as described in any in claim 1-5 is realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810487963.7A CN108764319A (en) | 2018-05-21 | 2018-05-21 | A kind of sample classification method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810487963.7A CN108764319A (en) | 2018-05-21 | 2018-05-21 | A kind of sample classification method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108764319A true CN108764319A (en) | 2018-11-06 |
Family
ID=64007388
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810487963.7A Pending CN108764319A (en) | 2018-05-21 | 2018-05-21 | A kind of sample classification method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108764319A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109682620A (en) * | 2018-12-06 | 2019-04-26 | 郭思 | A kind of appraisal procedure of domestic air conditioner refrigerating efficiency |
CN110909824A (en) * | 2019-12-09 | 2020-03-24 | 天津开心生活科技有限公司 | Test data checking method and device, storage medium and electronic equipment |
CN111767735A (en) * | 2019-03-26 | 2020-10-13 | 北京京东尚科信息技术有限公司 | Method, apparatus and computer readable storage medium for executing task |
CN112508134A (en) * | 2021-02-02 | 2021-03-16 | 贝壳找房(北京)科技有限公司 | Method, device, medium and electronic equipment for measuring similarity between sets |
CN113590677A (en) * | 2021-07-14 | 2021-11-02 | 上海淇玥信息技术有限公司 | Data processing method and device and electronic equipment |
CN114418752A (en) * | 2022-03-28 | 2022-04-29 | 北京芯盾时代科技有限公司 | Method and device for processing user data without type label, electronic equipment and medium |
WO2022121801A1 (en) * | 2020-12-07 | 2022-06-16 | 北京有竹居网络技术有限公司 | Information processing method and apparatus, and electronic device |
CN114662607A (en) * | 2022-03-31 | 2022-06-24 | 北京百度网讯科技有限公司 | Data annotation method, device and equipment based on artificial intelligence and storage medium |
-
2018
- 2018-05-21 CN CN201810487963.7A patent/CN108764319A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109682620A (en) * | 2018-12-06 | 2019-04-26 | 郭思 | A kind of appraisal procedure of domestic air conditioner refrigerating efficiency |
CN109682620B (en) * | 2018-12-06 | 2020-10-27 | 郭思 | Method for evaluating refrigeration efficiency of household air conditioner |
CN111767735A (en) * | 2019-03-26 | 2020-10-13 | 北京京东尚科信息技术有限公司 | Method, apparatus and computer readable storage medium for executing task |
CN110909824A (en) * | 2019-12-09 | 2020-03-24 | 天津开心生活科技有限公司 | Test data checking method and device, storage medium and electronic equipment |
CN110909824B (en) * | 2019-12-09 | 2022-10-28 | 天津开心生活科技有限公司 | Test data checking method and device, storage medium and electronic equipment |
WO2022121801A1 (en) * | 2020-12-07 | 2022-06-16 | 北京有竹居网络技术有限公司 | Information processing method and apparatus, and electronic device |
CN112508134A (en) * | 2021-02-02 | 2021-03-16 | 贝壳找房(北京)科技有限公司 | Method, device, medium and electronic equipment for measuring similarity between sets |
CN112508134B (en) * | 2021-02-02 | 2021-06-04 | 贝壳找房(北京)科技有限公司 | Method, device, medium and electronic equipment for measuring similarity between sets |
CN113590677A (en) * | 2021-07-14 | 2021-11-02 | 上海淇玥信息技术有限公司 | Data processing method and device and electronic equipment |
CN114418752A (en) * | 2022-03-28 | 2022-04-29 | 北京芯盾时代科技有限公司 | Method and device for processing user data without type label, electronic equipment and medium |
CN114662607A (en) * | 2022-03-31 | 2022-06-24 | 北京百度网讯科技有限公司 | Data annotation method, device and equipment based on artificial intelligence and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108764319A (en) | A kind of sample classification method and apparatus | |
WO2022126971A1 (en) | Density-based text clustering method and apparatus, device, and storage medium | |
US10547618B2 (en) | Method and apparatus for setting access privilege, server and storage medium | |
CN108090162A (en) | Information-pushing method and device based on artificial intelligence | |
CN108629823A (en) | The generation method and device of multi-view image | |
CN110827924B (en) | Clustering method and device for gene expression data, computer equipment and storage medium | |
CN108171191A (en) | For detecting the method and apparatus of face | |
CN108537291A (en) | A kind of sample classification method and apparatus | |
CN112463859B (en) | User data processing method and server based on big data and business analysis | |
CN112365202A (en) | Method for screening evaluation factors of multi-target object and related equipment thereof | |
CN111695840A (en) | Method and device for realizing flow control | |
CN110443264A (en) | A kind of method and apparatus of cluster | |
CN108615006A (en) | Method and apparatus for output information | |
CN111415196A (en) | Advertisement recall method, device, server and storage medium | |
CN108595211A (en) | Method and apparatus for output data | |
CN107968743A (en) | The method and apparatus of pushed information | |
CN110503117A (en) | The method and apparatus of data clusters | |
CN110263791A (en) | A kind of method and apparatus in identification function area | |
CN110298371A (en) | The method and apparatus of data clusters | |
CN111400663B (en) | Model training method, device, equipment and computer readable storage medium | |
CN113472860A (en) | Service resource allocation method and server under big data and digital environment | |
CN110532448B (en) | Document classification method, device, equipment and storage medium based on neural network | |
CN108062576B (en) | Method and apparatus for output data | |
CN109754273A (en) | The method and apparatus for promoting any active ues quantity | |
CN110019531A (en) | A kind of method and apparatus obtaining analogical object set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181106 |