CN115358349A - Data optimization clustering method - Google Patents

Data optimization clustering method Download PDF

Info

Publication number
CN115358349A
CN115358349A CN202211277521.2A CN202211277521A CN115358349A CN 115358349 A CN115358349 A CN 115358349A CN 202211277521 A CN202211277521 A CN 202211277521A CN 115358349 A CN115358349 A CN 115358349A
Authority
CN
China
Prior art keywords
characteristic peak
data
row
line
slope
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211277521.2A
Other languages
Chinese (zh)
Other versions
CN115358349B (en
Inventor
计爱幼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ruilian Credit Data Technology Co ltd
Original Assignee
Jiangsu Yijiesi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Yijiesi Information Technology Co ltd filed Critical Jiangsu Yijiesi Information Technology Co ltd
Priority to CN202211277521.2A priority Critical patent/CN115358349B/en
Publication of CN115358349A publication Critical patent/CN115358349A/en
Application granted granted Critical
Publication of CN115358349B publication Critical patent/CN115358349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the field of data processing, in particular to a data optimization clustering method. The method includes the steps of acquiring data to obtain two-dimensional matrix data, obtaining an optimal segmentation threshold value by utilizing a maximum inter-class variance, segmenting the two-dimensional matrix data by utilizing the optimal segmentation threshold value to obtain a binary matrix, obtaining a row and column accumulation sum curve according to the binary matrix, obtaining row and column characteristic peaks according to the row and column accumulation sum curve, analyzing according to each row and column characteristic peak to obtain a segmentation threshold value of each row and column characteristic peak, obtaining each row and column coordinate range by utilizing the segmentation threshold value of each row and column characteristic peak, obtaining each target area by combining each row and column coordinate range, setting each initial clustering center according to each target area, carrying out clustering analysis on the binary matrix data based on the initial clustering centers to obtain a class set, and being capable of being closer to each class central point through the setting mode of the initial clustering centers, so that the clustering calculation efficiency is improved.

Description

Data optimization clustering method
Technical Field
The invention relates to the technical field of data processing, in particular to a data optimization clustering method.
Background
The traditional data processing is usually processed by a clustering algorithm, and corresponding regulation and control are carried out by searching a dense area of data. In the traditional clustering segmentation, a plurality of starting points are generally selected at random, for example, in order to prevent the initial seed points from being selected too densely, a grid method is adopted to select the initial seed points, and data points are clustered and fused through the initial seed points, so that the purpose of data segmentation and clustering is finally achieved. However, the random selection of the initial seed points of the clusters inevitably leads to large calculation amount of the algorithm, and the cluster segmentation effect can be achieved by carrying out multiple iterations.
Aiming at the situation, the invention provides a data optimization clustering method, which is characterized in that data are analyzed, rows, columns and curves are constructed, and the positions of initial seed points are obtained according to the transformation conditions of the rows, the columns and the curves, so that the aims of reducing iterative times and reducing calculated amount by fast clustering segmentation are fulfilled.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a data optimized clustering method, which adopts the following technical solutions:
a method of optimized clustering of data, the method comprising:
collecting data to obtain a two-dimensional matrix;
obtaining a foreground data point set and a background data point set by performing threshold segmentation on the two-dimensional matrix; setting each data point in the two-dimensional matrix, which belongs to the foreground data point set, as 1, and setting each data point in the two-dimensional matrix data, which belongs to the background data point set, as 0 to obtain a binary matrix;
obtaining the abscissa range of each target according to the binary matrix, including: accumulating each row of data of the binary matrix to obtain row accumulation sums, recording a sequence formed by the row accumulation sums of all rows of the binary matrix as a row accumulation sum sequence, and constructing a row accumulation sum curve according to the row accumulation sum sequence; dividing the line accumulation sum curve to obtain a plurality of line characteristic peaks, obtaining a first slope and a second slope of each line characteristic peak according to each line characteristic peak, obtaining a first division threshold of each line characteristic peak according to the first slope and the second slope of each line characteristic peak and each line characteristic peak, and obtaining a second division threshold of each line characteristic peak according to the first division threshold of each line characteristic peak and each line characteristic peak; making a straight line parallel to the horizontal axis through a second segmentation threshold of each line characteristic peak, wherein the straight line and the line characteristic peaks are compared with two points, and the horizontal coordinates of the two points form the horizontal coordinate range of each target;
similarly, obtaining the vertical coordinate range of each target according to the binary matrix; the abscissa range and the ordinate range of each target form a target range of each target;
and placing an initial clustering point in the target range of each target of the two-dimensional matrix, and performing mean shift clustering on foreground data points of the two-dimensional matrix based on the initial clustering points to obtain all categories.
Preferably, the method for obtaining the foreground data point set and the background data point set by performing threshold segmentation on the two-dimensional matrix includes:
respectively carrying out segmentation processing on the two-dimensional matrix data by utilizing different preset segmentation threshold values to obtain a first class and a second class, and calculating the class variance of the two classes according to the first class and the second class, wherein the calculation formula of the class variance of the two classes is as follows:
Figure 277728DEST_PATH_IMAGE001
wherein
Figure 654483DEST_PATH_IMAGE002
Representing the proportion of the number of data in the first category and the second category to the total number of data in the two-dimensional matrix;
Figure 399585DEST_PATH_IMAGE003
Figure 101962DEST_PATH_IMAGE004
representing the mean of the data in the first class, the mean of the data in the second class,
Figure 431050DEST_PATH_IMAGE005
representing a class variance between the first class and the second class;
and each preset segmentation threshold corresponds to one category variance, a preset segmentation threshold corresponding to the maximum value of the category variances is selected to be recorded as an optimal segmentation threshold, all data points in a first category obtained by segmenting the optimal segmentation threshold are recorded as a foreground data point set, and all data points in a second category obtained by segmenting the optimal segmentation threshold are recorded as a background data point set.
Preferably, the method for obtaining the first slope and the second slope of each line characteristic peak according to each line characteristic peak includes:
obtaining a maximum value, a first minimum value and a second minimum value of each line characteristic peak, and calculating a first slope of each line characteristic peak according to the maximum value and the first minimum value of each line characteristic peak, wherein a first slope calculation formula is as follows:
Figure 877074DEST_PATH_IMAGE006
wherein ,
Figure 742262DEST_PATH_IMAGE007
the abscissa representing the maximum of each row feature peak,
Figure 615540DEST_PATH_IMAGE008
the ordinate representing the maximum of each row's characteristic peak,
Figure 402231DEST_PATH_IMAGE009
the abscissa representing the first minimum of each row feature peak,
Figure 651947DEST_PATH_IMAGE010
the ordinate representing the first minimum of each line feature peak,
Figure 371641DEST_PATH_IMAGE011
a first slope representing each line feature peak;
and similarly, calculating the second slope of each line characteristic peak according to the maximum value and the second minimum value of each line characteristic peak.
Preferably, the method for obtaining the first segmentation threshold of each line characteristic peak according to the first slope and the second slope of each line characteristic peak and each line characteristic peak includes:
for the maximum value, the first minimum value and the second minimum value of each row characteristic peak, acquiring the larger value of the first minimum value and the second minimum value and recording the larger value as the large minimum value;
the calculation formula for calculating the first segmentation threshold according to the first slope, the second slope, the maximum ordinate and the maximum and minimum of each line characteristic peak is as follows:
Figure 415820DEST_PATH_IMAGE012
wherein
Figure 955386DEST_PATH_IMAGE013
The ordinate representing the magnitude of the minimum value,
Figure 743213DEST_PATH_IMAGE014
a first slope representing the characteristic peak of each row,
Figure 317414DEST_PATH_IMAGE015
a second slope representing the characteristic peak of each row,
Figure 532495DEST_PATH_IMAGE008
the ordinate of the maximum value is represented,
Figure 824936DEST_PATH_IMAGE016
a first segmentation threshold representing each line characteristic peak, exp () represents an exponential model with a natural constant as the base.
Preferably, the method for obtaining the second segmentation threshold of each line characteristic peak according to the first segmentation threshold of each line characteristic peak and each line characteristic peak includes: acquiring the maximum value of each row characteristic peak; the calculation formula for calculating the second segmentation threshold of each line characteristic peak according to the maximum value of each line characteristic peak and the first segmentation threshold of each line characteristic peak is as follows:
Figure 150875DEST_PATH_IMAGE017
wherein
Figure 845162DEST_PATH_IMAGE018
A first segmentation threshold representing each line characteristic peak,
Figure 464100DEST_PATH_IMAGE008
the ordinate representing the maximum of each row's characteristic peak,
Figure 978257DEST_PATH_IMAGE019
the representation of the hyper-parameter is,
Figure 107888DEST_PATH_IMAGE020
the scale factor is expressed in terms of an empirical scale factor,
Figure 656681DEST_PATH_IMAGE021
representing the total number of columns of the binary matrix, exp () represents an exponential model with a natural constant as the base.
The invention has the following beneficial effects: the real-time embodiment of the invention obtains an optimal segmentation threshold value by utilizing the maximum between-class variance, segments the acquired two-dimensional matrix data by utilizing the optimal segmentation threshold value to obtain a binary matrix, obtains row and column accumulation sum curves according to the accumulation sum of each row and column data of the binary matrix, obtains the segmentation threshold value of each characteristic peak by analyzing each characteristic peak of the row and column accumulation sum curves, obtains row and column coordinate ranges by utilizing the segmentation threshold value of the characteristic peak, obtains all target ranges according to the row and column coordinate ranges, places an initial clustering point based on the target range, and the placement method of the initial clustering point is closer to the center of each category, thereby saving clustering iteration time and improving clustering efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating steps of a data optimized clustering method according to an embodiment of the present invention;
fig. 2 is a statistical histogram of a data optimized clustering method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of foreground data points of a data optimized clustering method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a row accumulation sum curve of a data optimized clustering method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a column accumulation sum curve of a data optimized clustering method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a smoothed row accumulation sum curve of a data optimized clustering method according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a smoothed column accumulation sum curve of a data optimized clustering method according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating row feature peaks of a data optimized clustering method according to an embodiment of the present invention;
fig. 9 is a schematic diagram of a segmentation row feature peak of a data optimized clustering method according to an embodiment of the present invention;
fig. 10 is a schematic diagram of a target range of a data optimization clustering method according to an embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description will be given for a data optimized clustering method according to the present invention, and its specific implementation, structure, features and effects thereof, with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following describes a specific scheme of the data optimization clustering method provided by the invention in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a data optimized clustering method according to an embodiment of the present invention is shown, where the method includes:
and S001, acquiring data and constructing a two-dimensional matrix.
The acquired data is data needing to find the cluster center, and the acquired data can be pure data or image data;
if the data is pure data, the data is constructed into a two-dimensional matrix, so that the relevance among the data is increased, for example, the vibration signal data is vibration signal data, the vibration signal data is one-dimensional time sequence data, and the accurate abnormality detection is difficult to perform when the one-dimensional vibration signal is analyzed, and the method for obtaining the two-dimensional matrix according to the one-dimensional time sequence data comprises the following steps: uniformly dividing one-dimensional time sequence data into
Figure 213564DEST_PATH_IMAGE022
Has a length of
Figure 215018DEST_PATH_IMAGE023
A subsequence of (2)
Figure 148339DEST_PATH_IMAGE022
Has a length of
Figure 551638DEST_PATH_IMAGE023
Into a subsequence of
Figure 13844DEST_PATH_IMAGE024
The matrix is referred to as a two-dimensional matrix.
If the image data is image data, the image itself is a two-dimensional matrix, the two-dimensional matrix processing is not performed on the image data, for example, the citrus is subjected to mildew detection, a layer of white lime is artificially spread on the surface of the citrus for insect prevention, white colonies are generated due to mildew, white points are obtained through threshold segmentation, dense areas are obtained through clustering, and mildew judgment is performed through analyzing the density degree of the white pixels.
Step S002: and obtaining a binary matrix according to the two-dimensional matrix, and determining initial clustering points according to the binary matrix.
1. Segmenting the two-dimensional matrix to obtain a foreground data point set and a background data point set:
for the collected data, usually all the data will not be analyzed, and often the foreground data is analyzed, i.e. to achieve the final purpose, the foreground data points in the collected data need to be extracted first. For the data aimed by the invention, the foreground data point and the background data point are obviously different, for example, in the vibration signal, the abnormal vibration signal value is obviously greater than the normal vibration signal value, and for example, lime or mildew on the surface of the citrus can obviously distinguish the color of the citrus peel. Therefore, a bimodal method is adopted to segment the two-dimensional matrix data, data points in the two-dimensional matrix are counted to obtain a statistical histogram, as shown in a schematic diagram 2, an optimal segmentation threshold is obtained by maximizing the inter-class variance, which is specifically as follows:
obtaining
Figure 502594DEST_PATH_IMAGE025
A preset division threshold value
Figure 239606DEST_PATH_IMAGE026
To preset a threshold value
Figure 497412DEST_PATH_IMAGE027
For illustration, the two-dimensional matrix is greater than a predetermined threshold
Figure 396098DEST_PATH_IMAGE027
The data of (2) are divided into a first class, and the two-dimensional matrix is smaller than a preset threshold value
Figure 372144DEST_PATH_IMAGE027
The data of (2) are divided into a second category, and the same is done
Figure 145802DEST_PATH_IMAGE025
And the threshold value of each preset segmentation threshold value is segmented to obtain a first category and a second category corresponding to each threshold value.
Calculating the category variance of the two categories according to the first category and the second category, wherein the calculation formula of the category variance of the two categories is as follows:
Figure 258115DEST_PATH_IMAGE001
wherein
Figure 593281DEST_PATH_IMAGE002
Representing the proportion of the number of data in the first category and the second category to the total number of data in the two-dimensional matrix;
Figure 791045DEST_PATH_IMAGE003
Figure 869859DEST_PATH_IMAGE004
means of data in the first category, means of data in the second category.
Figure 691140DEST_PATH_IMAGE005
Representing a class variance between the first class and the second class.
And each preset threshold corresponds to a category variance, and a preset segmentation threshold corresponding to the maximum value of the category variances is selected and recorded as an optimal segmentation threshold. All data points in a first category obtained by dividing the optimal division threshold are marked as a foreground data point set, all data points in a second category obtained by dividing the optimal division threshold are marked as a background data point set, a foreground data point image is shown as a schematic diagram in fig. 3, the abscissa in the schematic diagram 3 represents a normalized value of a row number, and the ordinate represents a normalized value of a column number, and the method for obtaining the normalized values of the row number and the column number according to the values of the row number and the column number specifically comprises the following steps: and dividing the serial number value of each row by the total row number to obtain a normalized value of the serial number of each row, and dividing the serial number value of each column by the total column number to obtain the normalized value of the serial number of each column.
And setting the value of the data point in the foreground data point set in the two-dimensional matrix as 1 and the value of the data point in the background data point set in the two-dimensional matrix as 0, thereby obtaining the binary matrix.
1. Constructing a row accumulation sum curve and a column accumulation sum curve:
for foreground data points in a binary matrix, it is generally necessary to find a region with dense data point distribution, for example, the denser the data point distribution with a larger vibration value of a vibration signal is, the greater the probability of abnormality occurring in the corresponding device is, and for example, white pixel points in a citrus image, a mildew region usually appears in a cluster shape, and surface lime is an irregular sheet-like white region or a discretely distributed white region. When clustering segmentation is carried out, a dense area of foreground pixels is generally required to be searched, so a line accumulation sum curve and a column accumulation sum curve are constructed, and the method for constructing the line accumulation sum curve and the column accumulation sum curve comprises the following steps:
accumulating each column of data in the binary matrix:
Figure 695742DEST_PATH_IMAGE028
in the formula
Figure 380802DEST_PATH_IMAGE029
Representing the second in a binary matrix
Figure 997728DEST_PATH_IMAGE030
The accumulated sum of all the data is listed,
Figure 84633DEST_PATH_IMAGE031
the number of rows representing the binary matrix,
Figure 761602DEST_PATH_IMAGE032
representing the second in a binary matrix
Figure 933957DEST_PATH_IMAGE030
Column No. 2
Figure 354574DEST_PATH_IMAGE033
The value of the row data point. All the columns of the binary matrix are obtained as a cumulative column sum sequence
Figure 295985DEST_PATH_IMAGE034
, wherein
Figure 878276DEST_PATH_IMAGE035
Represents the accumulated value of each column of data of the binary matrix, and represents all the column numbers of the binary matrix.
Similarly, accumulating each row of data points in the binary matrix:
Figure 803507DEST_PATH_IMAGE036
in the formula
Figure 762236DEST_PATH_IMAGE037
Represents a binary matrix of
Figure 558153DEST_PATH_IMAGE038
The accumulated sum of all the data of a row,
Figure 576925DEST_PATH_IMAGE021
the number of columns of the binary matrix is represented,
Figure 458293DEST_PATH_IMAGE032
represents a binary matrix of
Figure 220713DEST_PATH_IMAGE039
Go to the first
Figure 635251DEST_PATH_IMAGE040
The value of the column data point. And obtaining a row accumulation sum sequence by all rows of the binary matrix:
Figure 824924DEST_PATH_IMAGE041
, wherein
Figure 272217DEST_PATH_IMAGE042
Representing the accumulated value of each row of the image,
Figure 572749DEST_PATH_IMAGE031
representing the number of rows.
And drawing a row accumulated sum curve by using the row accumulated sum sequence with the normalized value of the row serial number as an abscissa and the accumulated sum of each row as an ordinate, wherein the row accumulated sum curve is shown in a schematic diagram 4 of an image of the row accumulated sum curve, the column accumulated sum curve is drawn by using the column accumulated sum sequence with the normalized value of the column serial number as an abscissa and the accumulated sum of each column as an ordinate, and the column accumulated sum curve is shown in a schematic diagram 5. Because the density degrees of data point distribution in the binary two-dimensional matrix are different, a part of the area is a dense area, a part of the area is a sparse area, and data points in the sparse area are few or even none, a value of a certain row and a certain column in a corresponding row and column sum curve is 0, so that the fluctuation degree of the row and column sum curve is very large, the analysis difficulty is high and a large error exists when the curve is analyzed, so that the curve is smoothed, small fluctuation in the smoothed curve can be smoothed, the whole variation trend of the curve is kept, the smoothed part does not cause a large influence on the selection of a subsequent initial point, the row accumulated sum curve and the column accumulated sum curve are subjected to Gaussian smoothing to obtain a smoothed row accumulated sum curve and column accumulated sum curve, the smoothed row accumulated sum curve image is shown as a schematic diagram 6, and the smoothed column accumulated sum curve image is shown as a schematic diagram 7.
1. Analyzing the row accumulation sum curve and the column accumulation sum curve to obtain a target area
Through analysis, when there are many target data points in a certain row or a certain column, the target data points in the corresponding binary matrix are likely to be distributed more densely, that is, corresponding to the higher curve part in the row cumulative sum curve and the column cumulative sum curve, and corresponding to the maximum value point and the minimum value point in the curve when there are the most or the least target data points in a certain row or a certain column. When the maximum value point of the curve is large, the corresponding row and column are most likely to have a plurality of dense areas; when a dense area exists in a binary matrix, a plurality of continuous row accumulation sums, column accumulation sums and values with larger sequence values appear, a corresponding curve shows a peak value, the higher the peak value is, a plurality of dense cluster types are more likely to exist, when dense cluster types are distributed sparsely, the wave of the corresponding curve is wider, if the initial clustering point is determined by only using the maximum value point of the row accumulation sum, the column accumulation sum curve, the less accurate the initial clustering point is, it is expected that the distribution condition of target data points in the binary matrix is considered to determine a plurality of initial clustering points, namely the sparsely distributed dense cluster types select a plurality of initial clustering points to facilitate faster iteration to obtain a dense central point, so that the curve is firstly segmented according to local extreme value points and divided into a plurality of characteristic peaks, and the method for segmenting the characteristic peaks is as follows:
(1) Obtaining characteristic peaks
Analyzing based on the smoothed row accumulation sum curve and column accumulation sum curve, taking the row accumulation sum curve as an example to illustrate, acquiring all minimum value points of the row accumulation sum curve, dividing the row accumulation sum curve into a plurality of row characteristic peaks by taking the minimum value points as a division boundary, and obtaining a row characteristic peak image by division as shown in a schematic diagram 8.
And in the same way, the column accumulation sum curve is divided to obtain a plurality of column characteristic peaks.
(2) Determining a second segmentation threshold for each characteristic peak:
the method for determining the second segmentation threshold of each row of characteristic peaks is described as an example, and specifically includes the following steps:
for each line characteristic peak, one maximum value point and two minimum value points exist, the first slope is calculated by using the maximum value and the first minimum value of the two minimum values, and the calculation formula is as follows:
Figure 608838DEST_PATH_IMAGE006
wherein
Figure 467947DEST_PATH_IMAGE011
Representing the first slope in each line characteristic peak,
Figure 323908DEST_PATH_IMAGE007
the abscissa representing the maximum point of each line feature peak,
Figure 428130DEST_PATH_IMAGE043
the ordinate of the maximum point is represented,
Figure 318726DEST_PATH_IMAGE044
horizontal and vertical lines representing the first minimum pointAnd (4) coordinates.
Similarly, the second slope is calculated by using the maximum value and the second minimum value of the two minimum values
Figure 850201DEST_PATH_IMAGE045
According to the slope self-adaptive selection threshold, the larger the slope of the characteristic peak is, the more concentrated the target data point distribution in the binary matrix is, that is, the smaller the corresponding initial point selection range can be, conversely, the smaller the slope of the characteristic peak is, the more discrete the target data point distribution in the binary two-dimensional matrix is, at this time, in order to increase the clustering speed, the more initial clustering points need to be selected, that is, the larger the corresponding initial point selection range can be. Calculating a first segmentation threshold for each line feature peak using the first slope, the second slope, and each line feature peak of each line feature peak:
Figure 193458DEST_PATH_IMAGE012
wherein
Figure 662223DEST_PATH_IMAGE018
A first segmentation threshold representing each line feature peak,
Figure 407325DEST_PATH_IMAGE046
respectively representing a first slope and a second slope of each line characteristic peak,
Figure 109702DEST_PATH_IMAGE008
the ordinate of the maximum point representing each line characteristic peak,
Figure 940255DEST_PATH_IMAGE047
representing the ordinate of the larger of the two local minima points. The smaller the slope of the characteristic peak is, the wider the characteristic peak is, i.e. the more likely the corresponding region is to be a sparser dense cluster.
At this time, the initial segmentation is completed, for each characteristic peak, the different heights of the waves represent different numbers of corresponding columns and rows and target data points, the higher the height of the wave is, it is indicated that a plurality of dense clusters are more likely to exist in the same column, at this time, the initial segmentation threshold is adjusted according to the height of the wave, the more the clusters are, the larger the considered range is, and the region obtained by the initial segmentation cannot well contain most data points in the sparse clusters, so that the adjustment is needed at this time, specifically, the following steps are performed:
Figure 386279DEST_PATH_IMAGE048
wherein
Figure 985888DEST_PATH_IMAGE049
A second partition value representing each line characteristic peak,
Figure 859166DEST_PATH_IMAGE005
a first segmentation threshold representing each line characteristic peak,
Figure 754179DEST_PATH_IMAGE008
the ordinate of the maximum point representing each line characteristic peak,
Figure 3894DEST_PATH_IMAGE021
representing the number of columns of the binary matrix,
Figure 192430DEST_PATH_IMAGE019
indicating hyper-parameters, empirical values
Figure 236610DEST_PATH_IMAGE050
Figure 41755DEST_PATH_IMAGE020
Indicating empirical scaling factor, empirical value
Figure 829582DEST_PATH_IMAGE051
And obtaining a second segmentation threshold value of each column characteristic peak according to each column characteristic peak in the same way.
Obtaining each target area according to the second segmentation threshold value of each characteristic peak
Obtaining the line coordinate range of each line characteristic peak according to each line characteristic peak and the line accumulation sum curve: the parallel abscissa of the second segmentation threshold of each line characteristic peak of the cross-line accumulation sum curve is taken as a straight line, and the line characteristic peak is compared with two points, wherein the coordinates of the intersection points are respectively
Figure 403783DEST_PATH_IMAGE052
The image obtained by segmenting the characteristic peak by using the second segmentation threshold is shown in a schematic diagram 9. Will be provided with
Figure 618863DEST_PATH_IMAGE053
The range is taken as the line coordinate range of each line feature peak. And similarly, obtaining the column coordinate range of each column characteristic peak according to each column characteristic peak and the column accumulation sum curve.
A target area is obtained according to each row coordinate range and column coordinate range, a plurality of target areas are obtained in all the row coordinate ranges and column coordinate ranges, and a plurality of target area images are shown in a schematic diagram 10.
(4) Determining initial clustering points according to target range
And placing an initial clustering point at the geometric center of each target range, and obtaining an initial clustering point set in all target ranges.
Step S003: and performing clustering processing based on the initial clustering points.
Based on the initial clustering point set, clustering foreground data points in the binary matrix by means of mean shift clustering to obtain a plurality of category sets.
In summary, in the embodiments of the present invention, the maximum inter-class variance is used to obtain the optimal segmentation threshold, the optimal segmentation threshold is used to segment the acquired two-dimensional matrix data to obtain the binary matrix, a row and column accumulation sum curve is obtained according to the accumulation sum of each row and column data of the binary matrix, each characteristic peak of the row and column accumulation sum curve is analyzed to obtain the segmentation threshold of each characteristic peak, the segmentation threshold of the characteristic peak is used to obtain the row and column coordinate range, all target ranges are obtained according to the row and column coordinate ranges, the initial clustering point is placed based on the target range, the placement method of the initial clustering point is closer to the center of each class, the clustering iteration time is saved, and the clustering efficiency is improved.
It should be noted that: the sequence of the above embodiments of the present invention is only for description, and does not represent the advantages or disadvantages of the embodiments. The processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (5)

1. A data optimization clustering method is characterized by comprising the following steps:
collecting data to obtain a two-dimensional matrix;
obtaining a foreground data point set and a background data point set by performing threshold segmentation on the two-dimensional matrix; setting each data point in the two-dimensional matrix, which belongs to the foreground data point set, as 1, and setting each data point in the two-dimensional matrix data, which belongs to the background data point set, as 0 to obtain a binary matrix;
obtaining the abscissa range of each target according to the binary matrix, including: accumulating each row of data of the binary matrix to obtain row accumulation sums, recording a sequence formed by the row accumulation sums of all rows of the binary matrix as a row accumulation sum sequence, and constructing a row accumulation sum curve according to the row accumulation sum sequence; dividing the line accumulation sum curve to obtain a plurality of line characteristic peaks, obtaining a first slope and a second slope of each line characteristic peak according to each line characteristic peak, obtaining a first division threshold of each line characteristic peak according to the first slope and the second slope of each line characteristic peak and each line characteristic peak, and obtaining a second division threshold of each line characteristic peak according to the first division threshold of each line characteristic peak and each line characteristic peak; making a straight line parallel to the horizontal axis through a second segmentation threshold of each line characteristic peak, wherein the straight line and the line characteristic peaks are compared with two points, and the horizontal coordinates of the two points form the horizontal coordinate range of each target;
similarly, obtaining the vertical coordinate range of each target according to the binary matrix; the abscissa range and the ordinate range of each target form a target range of each target;
and placing an initial clustering point in the target range of each target of the two-dimensional matrix, and performing mean shift clustering on foreground data points of the two-dimensional matrix based on the initial clustering points to obtain all categories.
2. The method for optimized clustering of data according to claim 1, wherein the method for obtaining the foreground data point set and the background data point set by performing threshold segmentation on the two-dimensional matrix comprises:
respectively carrying out segmentation processing on the two-dimensional matrix data by utilizing different preset segmentation threshold values to obtain a first category and a second category, and calculating the category variance of the two categories according to the first category and the second category, wherein the calculation formula of the category variance of the two categories is as follows:
Figure DEST_PATH_IMAGE001
wherein
Figure 587636DEST_PATH_IMAGE002
Representing the proportion of the number of data in the first category and the second category to the total number of data in the two-dimensional matrix;
Figure DEST_PATH_IMAGE003
Figure 517284DEST_PATH_IMAGE004
representing the mean of the data in the first category, the mean of the data in the second category,
Figure DEST_PATH_IMAGE005
representing a class variance between the first class and the second class;
and each preset segmentation threshold corresponds to one category variance, a preset segmentation threshold corresponding to the maximum value of the category variances is selected and recorded as an optimal segmentation threshold, all data points in a first category obtained by segmenting the optimal segmentation threshold are recorded as a foreground data point set, and all data points in a second category obtained by segmenting the optimal segmentation threshold are recorded as a background data point set.
3. The method for optimizing and clustering data according to claim 1, wherein the method for obtaining the first slope and the second slope of each line characteristic peak according to each line characteristic peak comprises:
obtaining a maximum value, a first minimum value and a second minimum value of each line characteristic peak, and calculating a first slope of each line characteristic peak according to the maximum value and the first minimum value of each line characteristic peak, wherein a first slope calculation formula is as follows:
Figure 240258DEST_PATH_IMAGE006
wherein ,
Figure DEST_PATH_IMAGE007
the abscissa representing the maximum of each row feature peak,
Figure 193040DEST_PATH_IMAGE008
the ordinate representing the maximum of each row characteristic peak,
Figure DEST_PATH_IMAGE009
the abscissa representing the first minimum of each line feature peak,
Figure 655069DEST_PATH_IMAGE010
the ordinate representing the first minimum of each line feature peak,
Figure DEST_PATH_IMAGE011
a first slope representing each line feature peak;
and similarly, calculating the second slope of each line characteristic peak according to the maximum value and the second minimum value of each line characteristic peak.
4. The method for optimizing clustering of data according to claim 1, wherein the method for obtaining the first segmentation threshold of each line feature peak according to the first slope and the second slope of each line feature peak and each line feature peak comprises:
for the maximum value, the first minimum value and the second minimum value of each row characteristic peak, acquiring the larger value of the first minimum value and the second minimum value and recording the larger value as the maximum value;
the calculation formula for calculating the first segmentation threshold according to the first slope, the second slope, the maximum ordinate and the maximum and minimum of each line characteristic peak is as follows:
Figure 439223DEST_PATH_IMAGE012
wherein
Figure DEST_PATH_IMAGE013
The ordinate representing the magnitude of the minimum value,
Figure 83831DEST_PATH_IMAGE014
a first slope representing the characteristic peak of each row,
Figure DEST_PATH_IMAGE015
a second slope representing the characteristic peak of each row,
Figure 212324DEST_PATH_IMAGE008
the ordinate of the maximum value is represented,
Figure 171053DEST_PATH_IMAGE016
a first segmentation threshold representing a characteristic peak of each line, exp () representing an exponential model with a natural constant as a base.
5. The method for optimizing and clustering data according to claim 1, wherein the method for obtaining the second segmentation threshold value of each line feature peak according to the first segmentation threshold value of each line feature peak and each line feature peak comprises: acquiring the maximum value of each row characteristic peak; the calculation formula for calculating the second segmentation threshold value of each line characteristic peak according to the maximum value of each line characteristic peak and the first segmentation threshold value of each line characteristic peak is as follows:
Figure DEST_PATH_IMAGE017
wherein
Figure 137610DEST_PATH_IMAGE018
A first segmentation threshold representing each line feature peak,
Figure 953119DEST_PATH_IMAGE008
the ordinate representing the maximum of each row's characteristic peak,
Figure DEST_PATH_IMAGE019
the representation of the hyper-parameter is,
Figure 772170DEST_PATH_IMAGE020
the scale factor is expressed in terms of an empirical scale factor,
Figure DEST_PATH_IMAGE021
representing the total number of columns of the binary matrix, exp () represents an exponential model with a natural constant as the base.
CN202211277521.2A 2022-10-19 2022-10-19 Data optimization clustering method Active CN115358349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211277521.2A CN115358349B (en) 2022-10-19 2022-10-19 Data optimization clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211277521.2A CN115358349B (en) 2022-10-19 2022-10-19 Data optimization clustering method

Publications (2)

Publication Number Publication Date
CN115358349A true CN115358349A (en) 2022-11-18
CN115358349B CN115358349B (en) 2023-08-15

Family

ID=84008683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211277521.2A Active CN115358349B (en) 2022-10-19 2022-10-19 Data optimization clustering method

Country Status (1)

Country Link
CN (1) CN115358349B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115577246A (en) * 2022-12-09 2023-01-06 杭州贝斯特气体有限公司 Method for detecting anti-vibration performance of gas cylinder protective cover
CN115623536A (en) * 2022-12-20 2023-01-17 苏州洛尔帝科技有限公司 High-reliability data transmission method of sensor signal based on LoRa
CN116432088A (en) * 2023-05-04 2023-07-14 常宝新材料(苏州)有限公司 Intelligent monitoring method and system for layer thickness of composite optical film

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115577246A (en) * 2022-12-09 2023-01-06 杭州贝斯特气体有限公司 Method for detecting anti-vibration performance of gas cylinder protective cover
CN115623536A (en) * 2022-12-20 2023-01-17 苏州洛尔帝科技有限公司 High-reliability data transmission method of sensor signal based on LoRa
CN116432088A (en) * 2023-05-04 2023-07-14 常宝新材料(苏州)有限公司 Intelligent monitoring method and system for layer thickness of composite optical film
CN116432088B (en) * 2023-05-04 2023-11-07 常宝新材料(苏州)有限公司 Intelligent monitoring method and system for layer thickness of composite optical film

Also Published As

Publication number Publication date
CN115358349B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN115358349A (en) Data optimization clustering method
CN109509199B (en) Medical image organization intelligent segmentation method based on three-dimensional reconstruction
CN106228185B (en) A kind of general image classifying and identifying system neural network based and method
CN115311292B (en) Strip steel surface defect detection method and system based on image processing
CN107705314B (en) Multi-object image segmentation method based on gray level distribution
CN110120042B (en) Crop image pest and disease damage area extraction method based on SLIC super-pixel and automatic threshold segmentation
CN116664559B (en) Machine vision-based memory bank damage rapid detection method
CN114943736B (en) Production quality detection method and system for automobile radiating fins
CN110008853B (en) Pedestrian detection network and model training method, detection method, medium and equipment
CN109871855B (en) Self-adaptive deep multi-core learning method
CN111860587B (en) Detection method for small targets of pictures
CN108932301A (en) Data filling method and device
CN112861919A (en) Underwater sonar image target detection method based on improved YOLOv3-tiny
CN115222625A (en) Laser radar point cloud denoising method based on multi-scale noise
CN116030052B (en) Etching quality detection method for lamination process of computer display panel
CN115272319B (en) Ore granularity detection method
CN111199245A (en) Rape pest identification method
CN115100467A (en) Pathological full-slice image classification method based on nuclear attention network
CN115082477A (en) Semiconductor wafer processing quality detection method based on light reflection removing effect
CN111428764A (en) Image clustering method for image category identification
CN117115197B (en) Intelligent processing method and system for design data of LED lamp bead circuit board
CN114359742B (en) Weighted loss function calculation method for optimizing small target detection
CN112949614B (en) Face detection method and device for automatically allocating candidate areas and electronic equipment
CN112241954B (en) Full-view self-adaptive segmentation network configuration method based on lump differentiation classification
CN114913152A (en) Machine vision-based wood cutting surface deburring quality evaluation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230619

Address after: Zhuoyehui B2103, Excellence Meilin Center Square (South Area), 126 Zhongkang Road, Meidu Community, Meilin Street, Futian District, Shenzhen, Guangdong 518000

Applicant after: Shenzhen Ruilian Credit Data Technology Co.,Ltd.

Address before: 226000 No. 500, Linyang Road, Qidong Economic Development Zone, Nantong City, Jiangsu Province

Applicant before: Jiangsu yijiesi Information Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant