CN115204323B - Seed multi-feature based clustering and synthesis method, system, device and medium - Google Patents

Seed multi-feature based clustering and synthesis method, system, device and medium Download PDF

Info

Publication number
CN115204323B
CN115204323B CN202211125597.3A CN202211125597A CN115204323B CN 115204323 B CN115204323 B CN 115204323B CN 202211125597 A CN202211125597 A CN 202211125597A CN 115204323 B CN115204323 B CN 115204323B
Authority
CN
China
Prior art keywords
classification number
sub
aggregation classification
calculating
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211125597.3A
Other languages
Chinese (zh)
Other versions
CN115204323A (en
Inventor
邵德意
王俊华
尹合兴
刘祥杰
王永卡
祝明新
田冰川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhi Biotechnology Co ltd
Original Assignee
Huazhi Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhi Biotechnology Co ltd filed Critical Huazhi Biotechnology Co ltd
Priority to CN202211125597.3A priority Critical patent/CN115204323B/en
Publication of CN115204323A publication Critical patent/CN115204323A/en
Application granted granted Critical
Publication of CN115204323B publication Critical patent/CN115204323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a clustering and comprehensive method, a system, equipment and a medium based on multiple characteristics of seeds, which firstly form a vector set of the multiple characteristics according to characteristic parameters of the seeds and take the multiple characteristics as classification standards; then obtaining the maximum aggregation classification number by a threshold method and carrying out classification by a fuzzy clustering algorithm, wherein the fuzzy clustering algorithm classifies the seeds with multiple characteristics through clustering analysis without manual division; after central vectors and a set of sub-vectors of all sub-clusters are obtained, bayes information values of all the aggregation classification numbers are calculated, the aggregation classification number with the maximum Bayes information value is selected, and unreasonable aggregation classification numbers are avoided; and finally, calculating to obtain a center vector set of the optimal aggregation classification number through the optimal aggregation classification number, calculating a space average Euclidean distance and a multi-dimensional space included angle according to the center vector set of the optimal aggregation classification number, and finally calculating to obtain a dispersion as comprehensive output of seed classification, so that the accuracy is improved.

Description

Seed multi-feature based clustering and synthesis method, system, device and medium
Technical Field
The invention relates to the technical field of feature classification, in particular to a seed multi-feature based clustering and synthesis method, system, equipment and medium.
Background
In the production, processing and circulation process of crop seeds, according to the regulations of crop seed inspection regulations, representative test samples of different batches of seeds need to be sampled and inspected to determine the quality, health and whether the seeds are mixed.
The traditional data classification of sampling results is based on the principle of significance statistical analysis to carry out difference analysis, the method can distinguish seed batches with larger differences, but generally uses the analysis of single index characters; considering that the characteristic values of the seeds in the same batch of the same variety have certain continuity difference distribution, the characteristic value data of the seeds in the same batch of different varieties may partially have coincidence intersection, and the characteristic values of the seeds have certain fluctuation due to different years and different cultivation management modes, the traditional data classification of the sampling result is difficult to distinguish the characteristic data difference between the seeds and in the seeds, and more is an aggregation classification method based on a single characteristic value. In addition, the current commonly used K-means clustering method is a hard classification method, and points at various junctions are easy to be misjudged, so that the identifiability is reduced.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method, a system, equipment and a medium for clustering and integrating based on multiple characteristics of seeds, which can effectively improve the accuracy of seed classification.
In a first aspect, an embodiment of the present invention provides a method for clustering and synthesizing based on seed multiple features, including:
obtaining N seeds, extracting M characteristic parameters of each seed, and respectively forming a data set by the Kth characteristic parameter of each seed; obtaining a vector set with a dimension of M and a length of N according to the data set; wherein N and M are integers, and K is an integer less than or equal to M;
determining the maximum aggregation classification number from the preset aggregation classification number according to a threshold value method
Figure 330107DEST_PATH_IMAGE002
From
Figure 488556DEST_PATH_IMAGE004
Sequentially selecting integers as aggregation classification numbers, and carrying out aggregation classification operation on the vector set according to the aggregation classification numbers and a fuzzy clustering algorithm to obtain central vectors and a sub-vector set of all sub-clusters of the aggregation classification numbers;
calculating the Bayesian information value of each aggregation classification number through the central vector and the sub-vector set of the sub-clusters;
selecting the aggregation classification number corresponding to the maximum Bayesian information value as the optimal aggregation classification number;
calculating to obtain a central vector set of the optimal aggregation classification number through the optimal aggregation classification number;
calculating the space average Euclidean distance and the multi-dimensional space included angle through the central vector set of the optimal aggregation classification number; calculating to obtain dispersion according to the space average Euclidean distance and the multi-dimensional space included angle;
and outputting the central vector set of the optimal aggregation classification number and the dispersion degree as the result of the seed clustering and synthesis.
The method provided by the embodiment of the invention has at least the following beneficial effects:
the method comprises the steps of firstly forming a multi-feature vector set according to the feature parameters of seeds, and taking a plurality of features as classification standards, so that the classification accuracy is improved; then obtaining the maximum aggregation classification number by a threshold method and classifying the maximum aggregation classification number by a fuzzy clustering algorithm, wherein the fuzzy clustering algorithm classifies the seeds with a plurality of characteristics through clustering analysis without manual division; after the central vectors and the sub-vector sets of all the sub-clusters are obtained, the Bayesian information value of each aggregation classification number is calculated, the aggregation classification number with the maximum Bayesian information value is selected, and the problem of misclassification caused by unreasonable aggregation classification numbers is solved; and finally, calculating to obtain a center vector set of the optimal aggregation classification number through the optimal aggregation classification number, calculating the space average Euclidean distance and the multi-dimensional space included angle according to the center vector set of the optimal aggregation classification number, calculating to obtain the dispersion, taking the center vector set and the dispersion of the optimal aggregation classification number as output results, and distinguishing the characteristic data difference between seeds and in the seeds more easily through the center vector set and the dispersion of the optimal aggregation classification number, so that the accuracy of seed classification is improved, and the problem that the seed characteristic difference among batches, between varieties and different storage time is difficultly considered is solved.
According to some embodiments of the present invention, after the K-th characteristic parameters of the N seeds are configured into a data set, the method further includes the steps of:
and carrying out median filtering on the data set to remove the outlier data.
According to some embodiments of the invention, the maximum number of aggregated classifications is determined from a preset number of aggregated classifications according to a thresholding method
Figure 262477DEST_PATH_IMAGE005
The method comprises the following steps:
Figure 862086DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 735364DEST_PATH_IMAGE009
represents a preset number of aggregation classifications,
Figure 912267DEST_PATH_IMAGE011
representing a rounding function.
According to some embodiments of the invention, the fuzzy clustering algorithm stopping condition comprises: the iteration times exceed a threshold value or the variance of all data in an FIFO buffer area is less than 0.001, and the FIFO buffer area is used for storing the objective function value obtained by each iteration calculation of the fuzzy clustering algorithm.
According to some embodiments of the invention, the calculation formula of the bayesian information value comprises:
Figure 630825DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 997125DEST_PATH_IMAGE015
a Bayesian information value representing the aggregate classification number, the
Figure 775725DEST_PATH_IMAGE017
Is shown as
Figure 580870DEST_PATH_IMAGE019
The number of vector points for a sub-cluster,
Figure 493332DEST_PATH_IMAGE021
is shown as
Figure 801953DEST_PATH_IMAGE019
The covariance of the sub-clusters is,
Figure 17034DEST_PATH_IMAGE023
the dimensions of the display are represented by,
Figure 168530DEST_PATH_IMAGE025
a penalty factor is indicated.
According to some embodiments of the invention, the calculating a spatial average euclidean distance and a multidimensional spatial angle by the set of center vectors of the optimal aggregate classification number comprises:
calculating the spatially averaged euclidean distance by:
calculating Euclidean distance between the central vector of each sub-cluster and the central vectors of other sub-clusters;
averaging all Euclidean distances except the maximum Euclidean distance to obtain the average Euclidean distance of each sub-cluster;
averaging the average Euclidean distances of all the sub-clusters to obtain the spatial average Euclidean distance;
calculating the multi-dimensional spatial angle by:
calculating the average central point of the central vectors of all the sub-clusters, and calculating the included angle between the central point of the central vector of each sub-cluster and the average central point:
Figure 494469DEST_PATH_IMAGE027
wherein the content of the first and second substances,
Figure 923176DEST_PATH_IMAGE029
an angle between a center point of a center vector representing the sub-cluster and the mean center point,
Figure 168213DEST_PATH_IMAGE031
the mean center point is represented by the mean center point,
Figure 947950DEST_PATH_IMAGE033
a center point representing a center vector of the sub-cluster;
and averaging the included angles between the central points of the central vectors of all the sub-clusters and the average central point to obtain the multi-dimensional spatial included angle.
According to some embodiments of the invention, the dispersion is calculated by the following formula:
Figure 438099DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 721313DEST_PATH_IMAGE037
the degree of dispersion is represented by a value,
Figure 543775DEST_PATH_IMAGE039
the spatial average euclidean distance is represented,
Figure 404284DEST_PATH_IMAGE041
representing a multi-dimensional spatial angle.
In a second aspect, an embodiment of the present invention provides a system for clustering and synthesizing based on seed multi-features, including:
the data acquisition module is used for acquiring N seeds, extracting M characteristic parameters of each seed and respectively forming a data set by the Kth characteristic parameter of each seed; obtaining a vector set with a dimension of M and a length of N according to the data set; wherein N and M are integers, and K is an integer less than or equal to M;
a maximum aggregation classification number selection module for selecting the maximum aggregation classification number from the pre-aggregation classification numbers according to a threshold methodDetermining the maximum number of aggregation classes among the set number of aggregation classes
Figure 72026DEST_PATH_IMAGE005
An aggregate classification module for classifying
Figure 209746DEST_PATH_IMAGE042
Sequentially selecting integers as aggregation classification numbers, and carrying out aggregation classification operation on the vector set according to the aggregation classification numbers and a fuzzy clustering algorithm to obtain central vectors and a sub-vector set of all sub-clusters of the aggregation classification numbers;
the Bayesian information acquisition module is used for calculating a Bayesian information value of each aggregation classification number through the central vector and the sub-vector set of the sub-clusters;
the optimal aggregation classification number selection module is used for selecting the aggregation classification number corresponding to the maximum Bayesian information value as the optimal aggregation classification number;
the clustering center vector assembly module is used for calculating the optimal aggregation classification number to obtain a center vector assembly of the optimal aggregation classification number;
the dispersion calculation module is used for calculating the space average Euclidean distance and the multi-dimensional space included angle through the central vector set of the optimal aggregation classification number; calculating to obtain dispersion according to the space average Euclidean distance and the multi-dimensional space included angle;
and the output module is used for outputting the central vector set of the optimal aggregation classification number and the dispersion as the result of the seed clustering and synthesis.
In a third aspect, an embodiment of the present invention provides an electronic device, including at least one control processor and a memory communicatively coupled to the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of seed multi-feature based clustering and synthesis according to the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer storage medium having stored thereon computer-executable instructions for causing a computer to perform the method for seed multi-feature based clustering and synthesis as described in the first aspect.
It should be noted that the beneficial effects between the second to fourth aspects of the present invention and the prior art are the same as the beneficial effects of the method for seed multi-feature based clustering and synthesis of the first aspect, and will not be described in detail herein.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for seed multi-feature based clustering and synthesis according to an embodiment of the present invention;
FIG. 2 is a block diagram of a system for seed multi-feature based clustering and synthesis according to an embodiment of the present invention;
FIG. 3 is a block diagram of an electronic device according to an embodiment of the invention;
FIG. 4 is a flow chart of the calculation of the spatial average Euclidean distance according to an embodiment of the present invention;
fig. 5 is a flowchart of a multi-dimensional spatial angle calculation according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, if there are first, second, etc. described, it is only for the purpose of distinguishing technical features, and it is not understood that relative importance is indicated or implied or the number of indicated technical features is implicitly indicated or the precedence of the indicated technical features is implicitly indicated.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to, for example, the upper, lower, etc., is indicated based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.
In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as arrangement, installation, connection and the like should be broadly understood, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
Referring to fig. 1, in some embodiments of the present invention, a method for seed multi-feature based clustering and synthesis is provided, including:
s100, acquiring N seeds, extracting M characteristic parameters of each seed, and respectively forming a data set by the Kth characteristic parameter of each seed; obtaining a vector set with a dimension of M and a length of N according to the data set; wherein N and M are integers, and K is an integer less than or equal to M;
step S200, determining the maximum aggregation classification number from the preset aggregation classification number according to a threshold value method
Figure 593323DEST_PATH_IMAGE002
Figure 439663DEST_PATH_IMAGE002
Step S300, from
Figure 176675DEST_PATH_IMAGE004
Sequentially selecting integers as the aggregate classification number, andperforming aggregation classification operation on the vector set according to the aggregation classification number and a fuzzy clustering algorithm to obtain central vectors and a sub-vector set of all sub-clusters of the aggregation classification number;
step S400, calculating the Bayesian information value of each aggregation classification number through the central vector and the subvector set of the sub-clusters;
s500, selecting the aggregation classification number corresponding to the maximum Bayesian information value as the optimal aggregation classification number;
step S600, calculating the optimal aggregation classification number to obtain a central vector set of the optimal aggregation classification number;
step S700, calculating the space average Euclidean distance and the multi-dimensional space included angle through the central vector set of the optimal aggregation classification number; calculating according to the space average Euclidean distance and the multi-dimensional space included angle to obtain dispersion;
and step S800, outputting the central vector set and the dispersion of the optimal aggregation classification number as the result of seed clustering and synthesis.
In step S100 of the embodiment of the method, first, a multi-feature vector set is formed according to the feature parameters of the seeds, and the features are used as the classification standard, so that the classification accuracy is improved; then in step S200, the maximum aggregation classification number is obtained through a threshold method, so that the condition that the maximum aggregation classification number needs to be determined through manual sampling detection is avoided; then, in the step S300, fuzzy clustering algorithm classification is carried out from 1 to the maximum aggregation classification number in sequence, the fuzzy clustering algorithm classifies the seeds with a plurality of characteristics through clustering analysis, and manual division is not needed; after the central vectors and the sub-vector sets of all the sub-clusters are obtained, the Bayesian information value of each aggregation classification number is calculated in the step S400, and the aggregation classification number with the maximum Bayesian information value is selected in the step S500, so that the problem of misclassification caused by unreasonable aggregation classification numbers is solved; finally, in step S600, a central vector set of the optimal aggregation classification number is obtained through calculation of the optimal aggregation classification number, in step S700, a spatial average euclidean distance and a multi-dimensional spatial angle are calculated according to the central vector set of the optimal aggregation classification number, finally, dispersion is obtained through calculation, the central vector set and the dispersion of the optimal aggregation classification number are used as output results, and the characteristic data differences among seeds and in the seeds are more easily distinguished through the central vector set and the dispersion of the optimal aggregation classification number, so that the accuracy of seed classification is improved, and the problem that the seed characteristic differences among batches, among varieties and in different storage times are difficult to consider is solved.
In some embodiments of the present invention, after forming a data set by the kth characteristic parameter of the N seeds, the method further includes the steps of:
and carrying out median filtering on the data set to remove the outlier data.
Specifically, a certain number of seeds are obtained, the number of the seeds is N, and M characteristic parameters of each seed are obtained
Figure 168901DEST_PATH_IMAGE044
Then, for the Kth feature parameter
Figure 723379DEST_PATH_IMAGE046
All have a data set
Figure 591104DEST_PATH_IMAGE048
. To exclude data sets due to incidental factors
Figure 283204DEST_PATH_IMAGE050
The individual data in the data collection causes large deviation, and a median filter algorithm is adopted to filter the data collection of each characteristic parameter to remove the wild point data.
Data set
Figure 753106DEST_PATH_IMAGE050
The process of implementing median filtering is as follows:
Figure 25956DEST_PATH_IMAGE052
wherein, the first and the second end of the pipe are connected with each other,
Figure 315729DEST_PATH_IMAGE054
the operator for taking the median value is shown,
Figure 611188DEST_PATH_IMAGE056
the order of the filter is represented by,
Figure 460499DEST_PATH_IMAGE056
is an odd number.
The median filter is adopted to remove data wild points caused by accidental factors in the data set, so that the main change trend of the whole data set is not damaged, and the data level of the seeds can be reflected by the data set.
In some embodiments of the invention, the maximum number of aggregated classes is determined from a preset number of aggregated classes according to a thresholding method
Figure 573424DEST_PATH_IMAGE002
The method comprises the following steps:
Figure 117538DEST_PATH_IMAGE057
wherein the content of the first and second substances,
Figure 203306DEST_PATH_IMAGE009
represents a preset number of aggregation classifications to be made,
Figure 447468DEST_PATH_IMAGE011
representing a rounding function.
The maximum aggregation classification number is determined by a threshold method without manual determination, and the algorithm can calculate the maximum aggregation classification number according to the input classification number only by inputting an approximate classification number, so that the labor cost and the time cost are saved.
In some embodiments of the invention, the fuzzy clustering algorithm stopping condition comprises: the iteration times exceed a threshold value or the variance of all data in the FIFO buffer area is less than 0.001, and the FIFO buffer area is used for storing the objective function value obtained by each iteration calculation of the fuzzy clustering algorithm.
For each clustering operation, the start-stop condition has a first condition and a second condition, and the first condition and the second condition are in a logical or relationship:
the first condition is as follows: the iteration times exceed 100 times and stop operation;
and a second condition: defining a FIFO buffer area with the length of 30 points, which is used for storing a target function value obtained by calculation in each iteration of an FCM algorithm (a fuzzy clustering algorithm used in the embodiment); when the variance of all data in the FIFO is less than 0.001, the operation is stopped.
And by the limitation of the stopping condition, each aggregation classification operation is ensured to perform sufficient clustering operation, and a relatively accurate clustering result is obtained.
In some embodiments of the invention, the calculation formula of the bayesian information value comprises:
Figure 452333DEST_PATH_IMAGE058
wherein the content of the first and second substances,
Figure 359109DEST_PATH_IMAGE015
a Bayesian information value representing the aggregate classification number, the
Figure 996370DEST_PATH_IMAGE017
Is shown as
Figure 593574DEST_PATH_IMAGE019
The number of vector points for a sub-cluster,
Figure 598701DEST_PATH_IMAGE021
is shown as
Figure 55090DEST_PATH_IMAGE019
The covariance of the sub-clusters is,
Figure 607294DEST_PATH_IMAGE023
the dimensions of the display are represented by,
Figure 495222DEST_PATH_IMAGE025
a penalty factor is indicated.
The best aggregate classification number is obtained by Bayesian information criterion calculation between the aggregate classification number 1 and the maximum aggregate classification number, and more accurate classification results are selected through Bayesian information, so that the accuracy and robustness of seed classification are improved.
Referring to fig. 4 and 5, in some embodiments of the present invention, calculating the spatial average euclidean distance and the multi-dimensional spatial angle by the set of central vectors of the optimal aggregate classification number includes:
the spatial average euclidean distance is calculated by:
and step S701, calculating Euclidean distances between the central vector of each sub-cluster and the central vectors of other sub-clusters.
Step S702, averaging all Euclidean distances except the maximum Euclidean distance to obtain the average Euclidean distance of each sub-cluster.
And step S703, averaging the average Euclidean distances of all the sub-clusters to obtain a spatial average Euclidean distance.
Calculating the multi-dimensional space angle by the following method:
step S707, calculating the average center point of the center vectors of all sub-clusters, and calculating the included angle between the center point of the center vector of each sub-cluster and the average center point:
Figure 248415DEST_PATH_IMAGE027
wherein the content of the first and second substances,
Figure 254417DEST_PATH_IMAGE029
an angle between a center point of a center vector representing the sub-cluster and the mean center point,
Figure 751257DEST_PATH_IMAGE031
the mean center point is represented by the mean center point,
Figure 526315DEST_PATH_IMAGE033
a center point representing a center vector of the sub-cluster;
step S708, averaging the included angles between the central point and the average central point of the central vectors of all the sub-clusters to obtain a multi-dimensional spatial included angle.
By calculating the dispersion of the spatial distance of the central point of each sub-cluster, the method not only provides a comprehensive judgment basis for multi-feature classification of the seeds, but also has lower calculation complexity and high realizability, integrates the dispersion of the central data points of various features, realizes the differentiation of the seed data difference between the seeds and the seed data difference in the seeds, and provides a method for transversely classifying the seeds between the seeds for sampling inspection and identification.
In some embodiments of the invention, the dispersion is calculated by the following formula:
Figure 715988DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 350232DEST_PATH_IMAGE037
the degree of dispersion is represented by a value,
Figure 650763DEST_PATH_IMAGE039
the spatial average euclidean distance is represented,
Figure 280327DEST_PATH_IMAGE041
representing a multi-dimensional spatial angle.
When the inter-species or intra-species needs to be subjected to transverse comparison classification, the divergence is taken as a parameter of a classification result to help analyze the transverse comparison of the seeds, and more accurate and more targeted classification can be performed.
To facilitate understanding by those skilled in the art, one embodiment of the present invention provides a method for clustering and synthesizing based on seed multi-features, comprising the steps of:
the first step, data filtering:
obtaining a certain number of seeds, N, and obtaining M characteristic parameters of each seed
Figure 640902DEST_PATH_IMAGE060
Then, for the Kth feature parameter
Figure 857382DEST_PATH_IMAGE062
All have a data set
Figure 961604DEST_PATH_IMAGE064
. To exclude data sets due to incidental factors
Figure 852199DEST_PATH_IMAGE066
The individual data in (2) causes large deviation, and a median filter algorithm is adopted to filter the data set of each characteristic parameter to remove 'outlier' data.
Data set
Figure 242730DEST_PATH_IMAGE066
The process of implementing median filtering is as follows:
Figure 585986DEST_PATH_IMAGE052
wherein, the first and the second end of the pipe are connected with each other,
Figure 962741DEST_PATH_IMAGE054
the operation of taking the median value is shown,
Figure 301318DEST_PATH_IMAGE056
the order of the filter is represented by,
Figure 862750DEST_PATH_IMAGE056
is an odd number.
Step two, aggregation classification:
first, data is collected
Figure 693302DEST_PATH_IMAGE066
Synthesizing a vector set with dimension M and length N
Figure 139327DEST_PATH_IMAGE068
Wherein each vector point
Figure 830946DEST_PATH_IMAGE070
Then, the number of aggregation classes is manually set
Figure 704224DEST_PATH_IMAGE072
Then the final determined maximum number of aggregation classes:
Figure 615549DEST_PATH_IMAGE057
number of aggregation classification
Figure 865264DEST_PATH_IMAGE074
From 1 to
Figure 444013DEST_PATH_IMAGE002
Taking middle value, using fuzzy clustering algorithm (FCM algorithm is used in this embodiment) to set
Figure 488193DEST_PATH_IMAGE076
Performing aggregate classification operation to the second
Figure 27758DEST_PATH_IMAGE078
Center vector of sub-cluster
Figure 674640DEST_PATH_IMAGE080
And a subset
Figure 248841DEST_PATH_IMAGE082
Wherein
Figure 198343DEST_PATH_IMAGE084
For each clustering operation, the start-stop condition has a first condition and a second condition, and the first condition and the second condition are a logical or relationship:
the first condition is as follows: stopping operation when the iteration times exceed 100 times;
and a second condition: defining a FIFO buffer area with the length of 30 points, and storing a target function value obtained by calculation in each iteration of the FCM; when the variance of all data in the FIFO is less than 0.001, the operation is stopped.
Finally, classifying the number according to the aggregation
Figure 116882DEST_PATH_IMAGE074
Calculating Bayesian information of each aggregate classification number
Figure 442822DEST_PATH_IMAGE086
Figure 871529DEST_PATH_IMAGE088
. Bayes' computational expression is as follows:
Figure 382145DEST_PATH_IMAGE058
wherein, the first and the second end of the pipe are connected with each other,
Figure 896303DEST_PATH_IMAGE015
a Bayesian information value representing the aggregate classification number, the
Figure 760353DEST_PATH_IMAGE017
Is shown as
Figure 433780DEST_PATH_IMAGE019
The number of vector points of a sub-cluster,
Figure 725084DEST_PATH_IMAGE021
denotes the first
Figure 726538DEST_PATH_IMAGE019
The covariance of the sub-clusters is,
Figure 784493DEST_PATH_IMAGE023
the dimensions of the display are represented by,
Figure 922213DEST_PATH_IMAGE025
a penalty factor is indicated.
Based on the result of the Bayesian information, the maximum Bayesian information value is calculated, i.e.
Figure 649998DEST_PATH_IMAGE090
. Obtaining the best aggregate classification number according to the maximum Bayesian information value
Figure 496338DEST_PATH_IMAGE092
According to the optimal aggregation classification number
Figure 233350DEST_PATH_IMAGE092
Selecting the central vector set of the sub-clusters obtained by calculation
Figure 350210DEST_PATH_IMAGE094
Wherein
Figure 248896DEST_PATH_IMAGE096
Thirdly, calculating the dispersion degree:
first, an average Euclidean distance is calculated
Figure 224942DEST_PATH_IMAGE098
. Calculating each center point
Figure 359120DEST_PATH_IMAGE080
With other central points
Figure 205854DEST_PATH_IMAGE100
The euclidean distance between them,
Figure 541020DEST_PATH_IMAGE102
(ii) a Removing the maximum Euclidean distance, averaging other Euclidean distances to obtain the central point
Figure 863417DEST_PATH_IMAGE080
Mean euclidean distance of
Figure 942231DEST_PATH_IMAGE104
(ii) a Averaging the average Euclidean distances of all the central points to obtain a space average Euclidean distance
Figure 472832DEST_PATH_IMAGE098
Then, clustering the center vector set
Figure 572375DEST_PATH_IMAGE106
Finding the mean center point
Figure 585331DEST_PATH_IMAGE108
And calculating each center point
Figure 936677DEST_PATH_IMAGE080
And mean center point
Figure 148216DEST_PATH_IMAGE108
The included angle between:
Figure 825185DEST_PATH_IMAGE110
averaging the included angle between each central point and the average central point to obtain a multi-dimensional space included angle
Figure 731961DEST_PATH_IMAGE112
Finally, the dispersion
Figure 775747DEST_PATH_IMAGE114
By spatial averaging Euclidean distances
Figure 717158DEST_PATH_IMAGE098
And average multi-dimensional spatial angle
Figure 158504DEST_PATH_IMAGE112
The combination is specifically expressed as follows:
Figure 146051DEST_PATH_IMAGE116
referring to fig. 2, an embodiment of the present invention further provides a system for clustering and synthesizing based on seed multi-features, which includes a data obtaining module 1001, a maximum aggregation classification number selecting module 1002, an aggregation classification module 1003, a bayesian information obtaining module 1004, an optimal aggregation classification number selecting module 1005, a cluster center vector integrating module 1006, a dispersion calculating module 1007, and an output module 1008, wherein:
the data acquisition module 1001 is configured to acquire N seeds, extract M characteristic parameters of each seed, and respectively form a data set with the kth characteristic parameter of each seed; obtaining a vector set with a dimension of M and a length of N according to the data set; wherein N and M are integers, and K is an integer less than or equal to M.
A maximum aggregation classification number selecting module 1002, configured to determine the maximum aggregation classification number from preset aggregation classification numbers according to a threshold method
Figure 104780DEST_PATH_IMAGE002
An aggregate classification module 1003 for classifying
Figure 900698DEST_PATH_IMAGE004
And sequentially selecting integers as the aggregation classification number, and carrying out aggregation classification operation on the vector set according to the aggregation classification number and a fuzzy clustering algorithm to obtain central vectors and a sub-vector set of all sub-clusters of the aggregation classification number.
And the bayesian information obtaining module 1004 is configured to calculate a bayesian information value of each aggregation classification number by using the central vector of the sub-cluster and the set of sub-vectors.
The optimal aggregate classification number selecting module 1005 is configured to select the aggregate classification number corresponding to the maximum bayesian information value as the optimal aggregate classification number.
And a clustering center vector assembly module 1006, configured to calculate a center vector assembly of the optimal aggregation classification number through the optimal aggregation classification number.
The dispersion calculation module 1007 is used for calculating the space average Euclidean distance and the multi-dimensional space included angle through the central vector set of the optimal aggregation classification number; and calculating to obtain the dispersion according to the space average Euclidean distance and the multi-dimensional space included angle.
And the output module 1008 is configured to output the central vector set and the dispersion of the optimal aggregation classification number as a result of seed clustering and synthesis.
It should be noted that, since the system for clustering and synthesizing based on seed multi-features in the present embodiment is based on the same inventive concept as the above-mentioned method for clustering and synthesizing based on seed multi-features, the corresponding contents in the method embodiment are also applicable to the present apparatus embodiment, and are not described in detail herein.
Referring to fig. 3, another embodiment of the present invention further provides an electronic device 6000, which may be any type of intelligent terminal, such as a mobile phone, a tablet computer, a personal computer, and the like.
Specifically, the electronic device 6000 includes: one or more control processors 6001 and a memory 6002, for example, a control processor 6001 and a memory 6002 in fig. 3, and the control processor 6001 and the memory 6002 can be connected by a bus or by other means, for example, in fig. 3.
The memory 6002 serves as a non-transitory computer-readable storage medium that can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to an electronic device in an embodiment of the present invention;
the control processor 6001 executes non-transitory software programs, instructions, and modules stored in the memory 6002 to perform various functional applications and data processing of a seed multi-feature based clustering and synthesis method, i.e., a seed multi-feature based clustering and synthesis method according to the above-described method embodiments.
The memory 6002 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the stored data area may store data created using a seed multi-feature based clustering and synthesis method, and the like. Further, memory 6002 can include high-speed random access memory and can also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 6002 optionally includes memory that is remotely located from the control processor 6001, and such remote memory can be coupled to the electronic device 6000 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Stored in the memory 6002 are one or more modules that, when executed by the one or more control processors 6001, perform a method for seed multi-feature based clustering and synthesis in the above-described method embodiments, such as performing the method steps of fig. 1, 4, and 5 described above.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It should be noted that, since an electronic device in the present embodiment is based on the same inventive concept as the above-mentioned method for clustering and synthesizing based on seed multi-features, the corresponding contents in the method embodiment are also applicable to the present apparatus embodiment, and are not described in detail herein.
An embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions for performing: the method for clustering and synthesizing based on seed multi-feature as above embodiment.
It should be noted that, since a computer-readable storage medium in the present embodiment and the above-mentioned method for clustering and synthesizing based on seed multi-features are based on the same inventive concept, the corresponding contents in the method embodiment are also applicable to the present apparatus embodiment, and are not described in detail herein.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of data such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired data and which can accessed by the computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any data delivery media as known to one of ordinary skill in the art.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. A method for clustering and synthesizing based on seed multi-features is characterized in that the method for clustering and synthesizing based on seed multi-features comprises the following steps:
obtaining N seeds, extracting M characteristic parameters of each seed, and respectively forming a data set by the Kth characteristic parameter of each seed; obtaining a vector set with a dimension of M and a length of N according to the data set; wherein N and M are integers, and K is an integer less than or equal to M;
determining the maximum aggregation classification number from the preset aggregation classification number according to a threshold value method
Figure DEST_PATH_IMAGE002
From
Figure DEST_PATH_IMAGE004
Sequentially selecting integers as aggregation classification numbers, and carrying out aggregation classification operation on the vector set according to the aggregation classification numbers and a fuzzy clustering algorithm to obtain central vectors and a sub-vector set of all sub-clusters of the aggregation classification numbers;
calculating the Bayesian information value of each aggregation classification number through the central vector and the sub-vector set of the sub-clusters;
selecting the aggregation classification number corresponding to the maximum Bayesian information value as the optimal aggregation classification number;
calculating to obtain a central vector set of the optimal aggregation classification number through the optimal aggregation classification number;
calculating the space average Euclidean distance and the multi-dimensional space included angle through the central vector set of the optimal aggregation classification number; calculating to obtain dispersion according to the space average Euclidean distance and the multi-dimensional space included angle;
and outputting the central vector set of the optimal aggregation classification number and the dispersion as the clustering and integrating result of the seeds.
2. The method for seed multi-feature based clustering and synthesis according to claim 1, wherein after the Kth feature parameters of N seeds are formed into a data set, the method further comprises the steps of:
and carrying out median filtering on the data set to remove the outlier data.
3. The seed multi-feature based clustering and synthesis method according to claim 1, wherein the maximum aggregate classification number is determined from preset aggregate classification numbers according to a threshold method
Figure DEST_PATH_IMAGE005
The method comprises the following steps:
Figure DEST_PATH_IMAGE007
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE009
represents a preset number of aggregation classifications to be made,
Figure DEST_PATH_IMAGE011
representing a rounding function.
4. The method for seed multi-feature based clustering and synthesis according to claim 1, wherein the fuzzy clustering algorithm stopping condition comprises: the iteration times exceed a threshold value or the variance of all data in an FIFO buffer area is less than 0.001, and the FIFO buffer area is used for storing the objective function value obtained by each iteration calculation of the fuzzy clustering algorithm.
5. The method of claim 3, wherein the Bayesian information value calculation formula comprises:
Figure DEST_PATH_IMAGE013
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE015
a Bayesian information value representing the aggregate classification number, the
Figure DEST_PATH_IMAGE017
Is shown as
Figure DEST_PATH_IMAGE019
The number of vector points for a sub-cluster,
Figure DEST_PATH_IMAGE021
denotes the first
Figure 562573DEST_PATH_IMAGE019
The covariance of the sub-clusters is,
Figure DEST_PATH_IMAGE023
the dimensions are represented by a number of dimensions,
Figure DEST_PATH_IMAGE025
a penalty factor is indicated.
6. The method of claim 5, wherein the calculating a spatial mean Euclidean distance and a multidimensional spatial angle through the set of center vectors of the optimal aggregate classification number comprises:
calculating the spatially averaged euclidean distance by:
calculating Euclidean distance between the central vector of each sub-cluster and the central vectors of other sub-clusters;
averaging all Euclidean distances except the maximum Euclidean distance to obtain the average Euclidean distance of each sub-cluster;
averaging the average Euclidean distances of all the sub-clusters to obtain the spatial average Euclidean distance;
calculating the multi-dimensional spatial angle by:
calculating the average central point of the central vectors of all the sub-clusters, and calculating the included angle between the central point of the central vector of each sub-cluster and the average central point:
Figure DEST_PATH_IMAGE027
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE029
an angle between a center point of a center vector representing the sub-cluster and the mean center point,
Figure DEST_PATH_IMAGE031
the mean center point is represented by the mean center point,
Figure DEST_PATH_IMAGE033
a center point representing a center vector of the sub-cluster;
and averaging the included angles between the central points of the central vectors of all the sub-clusters and the average central point to obtain the multi-dimensional space included angle.
7. The seed multi-feature based clustering and synthesis method of claim 6, wherein the dispersion is calculated by the following formula:
Figure DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE037
the degree of dispersion is represented by a value,
Figure DEST_PATH_IMAGE039
the spatial average euclidean distance is represented,
Figure DEST_PATH_IMAGE041
representing a multi-dimensional spatial angle.
8. A system for clustering and synthesis based on seed multi-features, comprising:
the data acquisition module is used for acquiring N seeds, extracting M characteristic parameters of each seed and respectively forming a data set by the Kth characteristic parameter of each seed; obtaining a vector set with a dimension of M and a length of N according to the data set; wherein N and M are integers, and K is an integer less than or equal to M;
a maximum aggregation classification number selection module for determining the maximum aggregation classification number from the preset aggregation classification number according to a threshold value method
Figure 483909DEST_PATH_IMAGE002
An aggregate classification module for classifying
Figure DEST_PATH_IMAGE042
Sequentially selecting integers as aggregation classification numbers, and performing aggregation classification on the vector set according to the aggregation classification numbers and a fuzzy clustering algorithmCalculating to obtain central vectors and a sub-vector set of all sub-clusters of the aggregation classification number;
the Bayesian information acquisition module is used for calculating a Bayesian information value of each aggregation classification number through the central vector and the sub-vector set of the sub-clusters;
the optimal aggregation classification number selecting module is used for selecting the aggregation classification number corresponding to the maximum Bayesian information value as the optimal aggregation classification number;
the clustering center vector assembly module is used for calculating the optimal aggregation classification number to obtain a center vector assembly of the optimal aggregation classification number;
the dispersion degree calculation module is used for calculating a space average Euclidean distance and a multi-dimensional space included angle through the central vector set of the optimal aggregation classification number; calculating to obtain dispersion according to the space average Euclidean distance and the multi-dimensional space included angle;
and the output module is used for outputting the central vector set of the optimal aggregation classification number and the dispersion as the clustering and integrating result of the seeds.
9. An electronic device, characterized in that: comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of seed multi-feature based clustering and synthesis of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that: the computer-readable storage medium stores computer-executable instructions for causing a computer to perform the method for seed multi-feature based clustering and synthesis of any one of claims 1 to 7.
CN202211125597.3A 2022-09-16 2022-09-16 Seed multi-feature based clustering and synthesis method, system, device and medium Active CN115204323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211125597.3A CN115204323B (en) 2022-09-16 2022-09-16 Seed multi-feature based clustering and synthesis method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211125597.3A CN115204323B (en) 2022-09-16 2022-09-16 Seed multi-feature based clustering and synthesis method, system, device and medium

Publications (2)

Publication Number Publication Date
CN115204323A CN115204323A (en) 2022-10-18
CN115204323B true CN115204323B (en) 2022-12-02

Family

ID=83572502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211125597.3A Active CN115204323B (en) 2022-09-16 2022-09-16 Seed multi-feature based clustering and synthesis method, system, device and medium

Country Status (1)

Country Link
CN (1) CN115204323B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522183B (en) * 2023-02-01 2023-09-15 华智生物技术有限公司 Sub-class center point feature extraction method, system and equipment based on seed clustering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563549A (en) * 2020-04-30 2020-08-21 广东工业大学 Medical image clustering method based on multitask evolutionary algorithm
CN112329850A (en) * 2020-11-05 2021-02-05 国网四川省电力公司经济技术研究院 Load clustering device and method based on principal component dimension reduction and gap statistics
CN113837311A (en) * 2021-09-30 2021-12-24 南昌工程学院 Resident customer clustering method and device based on demand response data

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100209886A1 (en) * 2009-02-18 2010-08-19 Gm Global Technology Operations, Inc. Driving skill recognition based on u-turn performance
RU2656708C1 (en) * 2017-06-29 2018-06-06 Самсунг Электроникс Ко., Лтд. Method for separating texts and illustrations in images of documents using a descriptor of document spectrum and two-level clustering
US11386713B2 (en) * 2020-05-06 2022-07-12 Motorola Solutions, Inc. Anomalous pose detection method and system
CN113159105B (en) * 2021-02-26 2023-08-08 北京科技大学 Driving behavior unsupervised mode identification method and data acquisition monitoring system
CN114139619A (en) * 2021-11-24 2022-03-04 北京华能新锐控制技术有限公司 Boiler combustion optimization control method and device based on improved K-means algorithm
CN114065819A (en) * 2021-11-26 2022-02-18 国网江苏省电力有限公司泰州供电分公司 Power utilization behavior analysis method and system based on multi-feature fusion and improved spectral clustering
CN114519310A (en) * 2022-02-24 2022-05-20 中国计量大学 Diagnosis method for aging fault of photovoltaic module

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563549A (en) * 2020-04-30 2020-08-21 广东工业大学 Medical image clustering method based on multitask evolutionary algorithm
CN112329850A (en) * 2020-11-05 2021-02-05 国网四川省电力公司经济技术研究院 Load clustering device and method based on principal component dimension reduction and gap statistics
CN113837311A (en) * 2021-09-30 2021-12-24 南昌工程学院 Resident customer clustering method and device based on demand response data

Also Published As

Publication number Publication date
CN115204323A (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN110830450A (en) Abnormal flow monitoring method, device and equipment based on statistics and storage medium
CN112911627B (en) Wireless network performance detection method, device and storage medium
US20030185436A1 (en) Method and system of object classification employing dimension reduction
CN115392408B (en) Method and system for detecting abnormal operation of electronic tablet counting machine
CN115204323B (en) Seed multi-feature based clustering and synthesis method, system, device and medium
CN112188532A (en) Training method of network anomaly detection model, network detection method and device
CN114114039B (en) Method and device for evaluating consistency of single battery cells of battery system
CN115876258B (en) Livestock and poultry breeding environment abnormity monitoring and alarming system based on multi-source data
CN114997256A (en) Method and device for detecting abnormal power of wind power plant and storage medium
CN111950498A (en) Lane line detection method and device based on end-to-end instance segmentation
CN115712846A (en) Network flow abnormity detection method and device, terminal equipment and readable storage medium
CN116804668B (en) Salt iodine content detection data identification method and system
CN116975672B (en) Temperature monitoring method and system for coal mine belt conveying motor
CN116206208A (en) Forestry plant diseases and insect pests rapid analysis system based on artificial intelligence
CN114219051B (en) Image classification method, classification model training method and device and electronic equipment
CN108415958A (en) The weight processing method and processing device of index weight VLAD features
CN116522183B (en) Sub-class center point feature extraction method, system and equipment based on seed clustering
CN114398964A (en) Fault diagnosis method, fault diagnosis device, electronic equipment and storage medium
CN108537092B (en) Variant red blood cell identification method and device
CN117591836B (en) Pipeline detection data analysis method and related device
CN113610781B (en) Method and device for detecting change of time sequence SAR (synthetic aperture radar) graph
CN117541832B (en) Abnormality detection method, abnormality detection system, electronic device, and storage medium
CN112381136B (en) Target detection method and device
CN117711593B (en) Intelligent pharmacy medicine inlet and outlet management system
US20230409422A1 (en) Systems and Methods for Anomaly Detection in Multi-Modal Data Streams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant