WO2022095379A1 - 数据降维处理方法、装置、计算机设备及存储介质 - Google Patents

数据降维处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022095379A1
WO2022095379A1 PCT/CN2021/091289 CN2021091289W WO2022095379A1 WO 2022095379 A1 WO2022095379 A1 WO 2022095379A1 CN 2021091289 W CN2021091289 W CN 2021091289W WO 2022095379 A1 WO2022095379 A1 WO 2022095379A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample data
feature
data
information
centroid
Prior art date
Application number
PCT/CN2021/091289
Other languages
English (en)
French (fr)
Inventor
王有金
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022095379A1 publication Critical patent/WO2022095379A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of big data processing, belongs to the application scenario of dimensionality reduction processing for sample data in smart cities, and in particular relates to a data dimensionality reduction processing method, device, computer equipment and storage medium.
  • the network has become an important way for people to obtain information.
  • the explosion of information has increased the task burden of computers in information search, resulting in low efficiency, and it is relatively difficult to obtain effective information accurately. difficulty.
  • the attribute information of multiple dimensions related to the data is usually obtained to describe the characteristics of the data. The more dimensions there are, the heavier the computing task is for the computer, and the less efficient it is to accurately obtain the required data.
  • the embodiments of the present application provide a data dimensionality reduction processing method, device, computer equipment, and storage medium, which aim to solve the problem that the prior art method cannot completely retain all attribute information of the data when reducing the dimensionality of the attributes of the data. .
  • an embodiment of the present application provides a data dimensionality reduction processing method, which includes:
  • sample data set input by the user If the sample data set input by the user is received, quantify the sample data included in the sample data set according to a preset information quantification rule to obtain feature quantization information of each of the sample data;
  • a distance feature value between each sample data in the sample data set and each of the target centroids is calculated to obtain a dimension reduction feature of each of the sample data.
  • an embodiment of the present application provides a data dimensionality reduction processing device, which includes:
  • a feature quantification information acquisition unit configured to quantify the sample data included in the sample data set according to a preset information quantification rule to obtain feature quantification information of each of the sample data if the sample data set input by the user is received;
  • a dimension quantity determination unit configured to receive the dimension reduction ratio value input by the user, and calculate the dimension quantity according to the dimension of the feature vector in the feature quantization information and the dimension reduction ratio value;
  • an initial centroid determination unit configured to group the sample data according to the number of dimensions and the feature quantification information of each of the sample data to obtain multiple groups of sample data and determine the initial centroid of each group of sample data;
  • centroid iterative correction unit configured to iteratively correct the initial centroids according to the multiple sets of sample data, so as to obtain a target centroid that matches each initial centroid
  • a dimensionality reduction feature acquisition unit configured to calculate a distance feature value between each sample data in the sample data set and each of the target centroids, so as to obtain a dimensionality reduction feature of each of the sample data.
  • an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer
  • the data dimensionality reduction processing method described in the first aspect above is implemented in the program.
  • an embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to execute the above-mentioned first step.
  • the data dimensionality reduction processing method is described.
  • Embodiments of the present application provide a data dimensionality reduction processing method, apparatus, computer equipment, and storage medium. Quantify the sample data in the sample data set according to the information quantification rules to obtain feature quantization information, determine the number of dimensions according to the feature quantization information and the dimensionality reduction ratio value, group the sample data according to the number of dimensions and the feature quantization information, and obtain the initial centroids. The initial execution of a group of sample data is iteratively revised to obtain the corresponding target centroids, and the distance feature value between each sample data and each target centroid is calculated as the dimension reduction feature of each sample data.
  • the attribute information of the sample data is dimensionally reduced.
  • the sample data can be processed based on the dimensionality reduction feature. For efficient processing, since all attribute information is retained, the accuracy of data analysis and processing can be ensured and the efficiency of sample data processing can be greatly improved.
  • FIG. 1 is a schematic flowchart of a data dimensionality reduction processing method provided by an embodiment of the present application
  • FIG. 2 is a schematic sub-flow diagram of a data dimensionality reduction processing method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of another sub-flow of the data dimensionality reduction processing method provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of another sub-flow of the data dimensionality reduction processing method provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of another sub-flow of the data dimensionality reduction processing method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of another sub-flow of the data dimensionality reduction processing method provided by the embodiment of the present application.
  • FIG. 7 is another schematic flowchart of a data dimensionality reduction processing method provided by an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a data dimensionality reduction processing apparatus provided by an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a data dimensionality reduction processing method provided by an embodiment of the present application.
  • the data dimensionality reduction processing method is applied to a user terminal.
  • the method is executed by application software installed in the user terminal.
  • a terminal is a terminal device used to perform a data dimensionality reduction processing method to complete the dimensionality reduction processing of the sample data, such as a desktop computer, a laptop computer, a tablet computer, or a mobile phone.
  • the method includes steps S110-S150.
  • sample data set input by the user is received, quantify the sample data included in the sample data set according to a preset information quantification rule to obtain feature quantization information of each of the sample data.
  • the feature quantization information of each sample data is obtained by quantizing the sample data included in the sample data set according to a preset information quantification rule.
  • the sample data includes multiple pieces of attribute information
  • the quantification rules include multiple quantification items
  • the information quantification rules are specific rules for quantifying multiple pieces of attribute information of the sample data in the sample data set.
  • the attribute information of a piece of sample data is converted into feature quantization information for quantitative representation, and the quantization items in the information quantization rules may be equal to or less than the number of attribute information items of the sample data.
  • the customer information in the customer information data set may include attribute information such as the customer's gender, age, occupation, hobbies, monthly income, marital status, childbearing status, etc.
  • the customer information of each customer in the information data set is converted into feature quantitative information for quantitative representation.
  • step S110 includes sub-steps S111 , S112 and S113 .
  • S111 Determine whether the attribute information corresponding to each quantification item of the information quantification rule in the sample data is a numerical value; S112. If the attribute information corresponding to the quantification item is a numerical value, according to the activation function of the quantification item The attribute information is calculated to obtain the quantified value of the attribute information; S113, if the attribute information corresponding to the quantized item is not a numerical value, obtain the numerical value corresponding to the keyword matching the attribute information in the quantified item as The quantized value of the attribute information.
  • the sample data in the sample data set contains multiple pieces of attribute information.
  • Each quantification item in the information quantification rule matches an attribute information, and each quantification item can convert a corresponding attribute information in the sample data into a quantification item.
  • the multiple quantization values corresponding to each sample data can be combined into the feature quantization information of the sample data, and the feature quantization information can be expressed as a multi-dimensional feature vector, that is, the feature quantization information corresponding to each attribute information
  • a feature vector of one dimension in , the range of quantized values obtained by quantizing an item of attribute information corresponding to each quantized item is [0, 1]. Specifically, it can be judged whether the attribute information is a numerical value.
  • the quantified value of the attribute information is obtained by calculating the activation function that matches the attribute information in the information quantification rule. If the attribute information is not a numerical value, obtain The value corresponding to the keyword matching the attribute information in the information quantization rule is used as the quantization value of the attribute information.
  • the corresponding quantization rule in the information quantization rule is an activation function and an intermediate value, and the intermediate value and an attribute information of the quantization item are calculated according to the activation function, that is, The corresponding quantization value can be obtained.
  • the activation function can be expressed as: Wherein, x is an item of information corresponding to the quantization item, and v is an intermediate value corresponding to the quantization item.
  • the occupation of information quantification rules contains four keywords: "student", “doctor", "teacher” and "programmer”. The value corresponding to "student” is "0", and the value corresponding to "doctor” is "0".
  • the value is "0.25", the value corresponding to "Teacher” is “0.6”, the value corresponding to "Programmer” is "1”, and the occupation of a certain customer information in the sample data set is a teacher, then the corresponding quantitative value is "0.6"".
  • S120 Receive the dimension reduction ratio value input by the user, and calculate the number of dimensions according to the dimension of the feature vector in the feature quantization information and the dimension reduction ratio value.
  • the user can directly input the dimension reduction ratio value, and directly calculate the number of dimensions through the dimension reduction ratio value and the dimension of the feature vector in the feature quantization information. Specifically, multiply the dimension reduction ratio value by the dimension number of the feature vector and round the product to obtain the number of dimensions. Under normal circumstances, the number of dimensions obtained is much smaller than the number of dimensions of the feature vector in the feature quantization information. .
  • the number of dimensions of the feature vector in the feature quantization information is 41
  • the dimension reduction ratio is 0.15
  • the number of dimensions is 6 after rounding.
  • S130 Group the sample data according to the number of dimensions and the feature quantification information of each sample data to obtain multiple sets of sample data, and determine the initial centroid of each set of sample data.
  • the sample data are grouped according to the number of dimensions and feature quantification information of each of the sample data to obtain multiple sets of sample data, and an initial centroid of each set of sample data is determined.
  • the number of dimensions can be calculated by combining the dimension reduction ratio value input by the user and the number of dimensions of the feature vector in the feature quantization information. information.
  • the sample data can be grouped by the number of dimensions to obtain multiple groups of sample data, and the initial centroid of each group of sample data can be determined, then the number of obtained initial centroids is equal to the number of dimensions.
  • step S130 includes sub-steps S131 and S132.
  • S131 Randomly group the sample data according to the number of dimensions to obtain multiple sets of sample data; S132, respectively obtain feature quantification information of a piece of sample data from each set of sample data as the initial centroid of each set of sample data.
  • all sample data can be randomly grouped according to the number of dimensions to obtain multiple sets of sample data, each set of sample data contains basically the same number of sample data, and the characteristics of one piece of sample data are obtained from the multiple sets of sample data obtained by the grouping. Quantify information as the initial centroids for each set of sample data.
  • step S130 may further include sub-steps S1301 , S1302 and S132 .
  • S1301 Construct a data grouping model according to a preset grouping template, the feature quantization information and the number of dimensions; S1302. Input the feature quantization information of each piece of sample data into the data grouping model in turn to classify the sample data Perform grouping to obtain multiple sets of sample data; S132, obtain feature quantification information of a piece of sample data from each set of sample data as the initial centroid of each set of sample data.
  • a data grouping model can be constructed according to the grouping template, feature quantization information, and the number of dimensions, and multiple groups of sample data can be obtained by grouping the sample data through the data grouping model.
  • the grouping template can include a fully connected layer, which can be based on feature quantization.
  • the input node is obtained by information construction, and the output node is obtained based on the number of dimensions.
  • the data grouping model based on the neural network can be obtained, and each input node corresponds to one of the feature quantization information.
  • the quantized value of the dimension, each output node corresponds to a group.
  • the input node and the output node are connected through a fully connected layer.
  • the fully connected layer contains multiple feature units.
  • a first formula group is set between the input node and the fully connected layer, and a first formula group is set between the output node and the fully connected layer.
  • Two formula groups are included in the first formula group. Among them, the first formula group includes formulas from all input nodes to all feature units, the formulas in the first formula group use the input node value as input value and the feature unit value as output value, and the second formula group includes all output nodes to all
  • the formula of the feature unit, the formulas in the second formula group all take the feature unit value as the input value and the output node value as the output value, and each formula included in the obtained data classification model has corresponding parameter values.
  • the output node value is also the matching probability between the feature quantization information and the group corresponding to the output node.
  • the output node with the highest matching probability is selected.
  • the corresponding grouping is regarded as the grouping that matches the sample data, so that the sample data can be grouped and multiple sets of sample data can be obtained. At this time, the quantity of sample data contained in each set of sample data obtained may be quite different.
  • step S1303 may be further included before step S1302 .
  • the user who inputs the training data set is the user of the user terminal.
  • iterative training of the data grouping model is required, that is, the data grouping model is divided into groups.
  • the parameter values in the first formula group and the second formula group of the model are adjusted, and the data grouping model obtained after training can group the feature quantization information more accurately.
  • the gradient descent training model is a model for training the data grouping model.
  • the gradient descent training model includes the loss value calculation formula and the gradient calculation formula.
  • the training data set contains multiple pieces of training data, and each piece of training data contains a piece of feature quantification information.
  • the update value corresponding to each parameter in the first formula group and the second formula group can be calculated according to the loss value and the gradient calculation formula, and the parameter value corresponding to each parameter can be updated by updating the value.
  • the process of updating the parameter values is the specific process of training the data grouping model.
  • the loss value calculation formula can be expressed as Among them, f p is the matching probability of an output node corresponding to the grouping label in the data grouping model, f n is the matching probability of the nth output node, and the value ranges of f p and f n are both [0, 1].
  • the updated value of each parameter in the data grouping model is calculated according to the gradient calculation formula, the loss value and the calculated value of the data grouping model. Specifically, the calculated value obtained by calculating the feature quantization information with a parameter in the data grouping model is input into the gradient calculation formula, and combined with the above loss value, the update value corresponding to the parameter can be calculated. Computed for gradient descent.
  • the gradient calculation formula can be expressed as:
  • ⁇ x is the original parameter value of the parameter x
  • is the preset learning rate in the gradient calculation formula
  • the parameter value of each parameter in the data grouping model is updated correspondingly, that is, a training process of the data grouping model is completed.
  • a training process of the data grouping model is completed.
  • another piece of training data in the training data set is calculated and processed again, and the above training process is repeated to implement iterative training of the data grouping model; when the calculated loss value is less than the preset value After the loss threshold or the pieces of training data in the training data set are used for training, the training process is terminated to obtain the trained data grouping model.
  • each set of sample data contains multiple pieces of sample data, which can be iteratively calculated based on multiple sets of sample data and the initial centroid, and the initial centroid is iteratively corrected to obtain the corresponding target centroid, and each initial centroid is iteratively corrected to obtain a corresponding target Centroid.
  • step S140 further includes sub-steps S141 , S142 , S143 , S144 , S145 and S146 .
  • the Euclidean distance between each sample data and each initial centroid in a set of sample data can be calculated, and the Euclidean distance between a piece of sample data and an initial centroid can be calculated using formula (1):
  • M is the feature vector included in the feature quantization information the number of dimensions.
  • the iteratively corrected corrected centroid can be used as the target centroid corresponding to the initial centroid; it is also possible to judge whether each group of regrouped sample data satisfies the preset iteration by using preset iteration conditions If the condition is satisfied, the iterative correction is continued. If it is not satisfied, the corrected centroid of the iterative correction can be used as the target centroid corresponding to the initial centroid.
  • the distance threshold or mean square error threshold can be configured in the iteration conditions, and the difference between all the sample data in a certain group of sample data and the group of sample data can be calculated. Correct the distance value between the centroids, calculate the average value of the distance values to obtain the average distance value, and judge whether there is an average distance value greater than the distance threshold in the average distance value of each group of sample data, if so, it is determined to meet the iteration conditions; The average distance value of the group sample data is not greater than the distance threshold, and it is determined that the iteration conditions are not met.
  • step S141 the process returns to step S141 to continue the iterative correction, and if the iterative conditions are not satisfied, the currently obtained corrected centroid is taken as the target centroid corresponding to the initial centroid.
  • step S1401 is further included after step S140 .
  • each target centroid corresponds to a set of sample data, and each target can be determined based on the attribute information of the multiple sets of sample data obtained after the re-grouping.
  • Feature labels for centroids Specifically, the attribute information of each group of sample data can be counted to obtain the statistical results corresponding to each group of sample data and each item of attribute information, and the attributes whose attribute values in the statistical results of a group of sample data exceed the preset ratio value are obtained.
  • the value is used as the feature label of the set of sample data, that is, the feature label of each target centroid can be determined, and the feature label of the target centroid can be used to characterize the features of a set of sample data corresponding to the target centroid.
  • the feature labels of the sample data can be used to understand the overall feature information of the group of sample data.
  • the preset ratio is 75%
  • the ratio of males is 20%
  • the ratio of females is 80%. If the ratio of women in gender exceeds the preset ratio, the attribute value will be used as The feature label of the target centroid corresponding to this set of sample data.
  • S150 Calculate the distance feature value between each sample data in the sample data set and each target centroid, so as to obtain a dimension reduction feature of each sample data.
  • a distance feature value between each sample data in the sample data set and each of the target centroids is calculated to obtain a dimension reduction feature of each of the sample data.
  • the distance characteristic value between each sample data and the target centroid is calculated according to the characteristic quantification information of the sample data, and the specific process of calculating the distance characteristic value is to calculate the Euclidean distance between the two.
  • the dimensionality reduction feature of the sample data can be obtained.
  • the number of target centroids is equal to the number of dimensions, and the number of distance eigenvalues contained in the dimensionality reduction feature is also equal to the number of dimensions.
  • the dimensionality reduction feature can be represented by a multidimensional feature vector equal to the number of dimensions.
  • the 41-dimensional feature vector included in the feature quantization information of the sample data can be reduced in dimension
  • a 6-dimensional feature vector is obtained as the dimension reduction feature Jx of the sample data.
  • the overall feature information of the sample data can also be obtained by combining the dimensionality reduction feature of the sample data with the feature label of the sample data group to which the sample data belongs, and the dimensionality reduction feature in the overall feature information of the sample data is used for the sample data.
  • the features of the sample data are quantitatively represented, and the feature labels in the overall feature information are used to characterize the features of the sample data in text form.
  • the dimensionality reduction feature of each sample data is calculated by the above method. While retaining all the attribute information of the sample data, the dimensionality reduction processing of the attribute information of the sample data is realized, and the sample data is screened or classified in the subsequent analysis processing. When , the sample data can be efficiently processed based on the dimensionality reduction feature. Since all attribute information is retained, the accuracy of the data analysis and processing can be ensured and the processing efficiency of the sample data can be greatly improved.
  • the technical methods in this application can be applied to application scenarios such as smart government affairs/smart city management/smart community/smart security/smart logistics/smart medical care/smart education/smart environmental protection/smart transportation, etc. which include dimensionality reduction processing on sample data, so as to Promote the construction of smart cities.
  • the sample data in the sample data set is quantized according to the information quantification rules to obtain feature quantization information
  • the number of dimensions is determined according to the feature quantization information and the dimensionality reduction ratio value
  • the number of dimensions is determined according to the number of dimensions and
  • the feature quantification information groups the sample data and obtains the initial centroid, iteratively corrects the initial execution of each group of sample data to obtain the corresponding target centroid, and calculates the distance feature value between each sample data and each target centroid as each sample.
  • Dimensionality reduction features of the data Through the above method, while retaining all the attribute information of the sample data, the attribute information of the sample data is dimensionally reduced. When the sample data is subsequently screened or classified, the sample data can be processed based on the dimensionality reduction feature. For efficient processing, since all attribute information is retained, the accuracy of data analysis and processing can be ensured and the efficiency of sample data processing can be greatly improved.
  • the embodiment of the present application further provides a data dimension reduction processing apparatus, and the data dimension reduction processing apparatus is configured to execute any one of the foregoing data dimension reduction processing methods.
  • FIG. 8 is a schematic block diagram of a data dimensionality reduction processing apparatus provided by an embodiment of the present application.
  • the data dimensionality reduction processing apparatus may be configured in a user terminal.
  • the data dimension reduction processing apparatus 100 includes a feature quantization information acquisition unit 110 , a dimension quantity determination unit 120 , an initial centroid determination unit 130 , a centroid iterative correction unit 140 , and a dimension reduction feature acquisition unit 150 .
  • the feature quantification information acquisition unit 110 is configured to quantify the sample data included in the sample data set according to preset information quantification rules to obtain feature quantification information of each of the sample data if the sample data set input by the user is received .
  • the feature quantization information acquisition unit 110 includes subunits: an attribute information determination unit, a first quantization processing unit, and a second quantization processing unit.
  • the attribute information judgment unit is used for judging whether the attribute information corresponding to each quantization item of the information quantization rule in the sample data is a numerical value; the first quantization processing unit is used for if the attribute information corresponding to the quantization item is Numerical value, the attribute information is calculated according to the activation function of the quantized item to obtain the quantized value of the attribute information; the second quantization processing unit is used to obtain the attribute information corresponding to the quantized item if the attribute information is not a numerical value.
  • the value corresponding to the keyword matching the attribute information in the quantization item is used as the quantization value of the attribute information.
  • the dimension quantity determination unit 120 is configured to receive the dimension reduction ratio value input by the user, and calculate the dimension quantity according to the dimension of the feature vector in the feature quantization information and the dimension reduction ratio value.
  • the initial centroid determining unit 130 is configured to group the sample data according to the number of dimensions and feature quantification information of each sample data to obtain multiple sets of sample data and determine the initial centroid of each set of sample data.
  • the initial centroid determining unit 130 includes subunits: a random grouping unit and an initial centroid obtaining unit.
  • the random grouping unit is used to randomly group the sample data according to the number of dimensions to obtain multiple groups of sample data; the initial centroid acquisition unit is used to obtain a piece of feature quantification information of the sample data from each group of sample data as each group. The initial centroid of the sample data.
  • the initial centroid determining unit 130 includes subunits: a data grouping model building unit, a sample data grouping unit, and an initial centroid obtaining unit.
  • a data grouping model construction unit used for constructing a data grouping model according to a preset grouping template, the feature quantization information and the number of dimensions
  • a sample data grouping unit used for sequentially inputting the feature quantization information of each piece of the sample data
  • the data grouping model is used to group the sample data to obtain multiple groups of sample data
  • an initial centroid acquisition unit is used to obtain the feature quantification information of a piece of sample data from each group of sample data as the initial centroid of each group of sample data.
  • the initial centroid determination unit 130 further includes a subunit: a data grouping model training unit.
  • the data grouping model training unit is configured to iteratively train the data grouping model according to a preset gradient descent training model and the training data set to obtain a trained data grouping model if a training data set input by the user is received.
  • the centroid iterative correction unit 140 is configured to iteratively correct the initial centroids according to the multiple sets of sample data, so as to obtain a target centroid matching each initial centroid.
  • the centroid iterative correction unit 140 includes subunits: a distance value acquisition unit, a regrouping unit, a corrected centroid acquisition unit, an iterative judgment unit, a return execution unit, and a target execution acquisition unit.
  • a distance value obtaining unit used for obtaining a distance value between the sample data in each group of sample data and each of the initial centroids
  • a regrouping unit used for obtaining the distance values between the sample data and each of the initial centroids
  • the sample data is regrouped by the distance value of Whether the sample data satisfies the preset iteration conditions; return to the execution unit, for if the iteration conditions are met, use the corrected centroid as the initial centroid and return to execute the acquisition of the sample data in each group of sample data and each of the The step of the distance value between the initial centroids; a target execution acquisition unit, configured to use the modified centroids as the target centroids if the iteration condition is not satisfied.
  • the data dimensionality reduction processing apparatus 100 further includes a subunit: a feature label acquisition unit.
  • a feature label obtaining unit configured to determine a feature label of each of the target centroids according to attribute information of each sample data in the sample data set.
  • the dimensionality reduction feature acquisition unit 150 is configured to calculate the distance feature value between each sample data in the sample data set and each of the target centroids, so as to obtain the dimensionality reduction feature of each of the sample data.
  • the data dimensionality reduction processing device applies the above data dimensionality reduction processing method, quantizes the sample data in the sample data set according to the information quantification rules to obtain feature quantization information, and determines the dimension according to the feature quantization information and the dimensionality reduction ratio value. Quantity, group the sample data according to the number of dimensions and feature quantification information and obtain the initial centroid, iteratively correct the initial execution of each group of sample data to obtain the corresponding target centroid, and calculate the distance between each sample data and each target centroid
  • the eigenvalues are used as dimensionality reduction features of each sample data.
  • the above data dimensionality reduction processing apparatus can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 9 .
  • FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device may be a user terminal for executing a data dimensionality reduction processing method to perform dimensionality reduction processing on sample data.
  • the computer device 500 includes a processor 502 , a memory and a network interface 505 connected through a system bus 501 , wherein the memory may include a non-volatile storage medium 503 and an internal memory 504 .
  • the nonvolatile storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the computer program 5032 When executed, it can cause the processor 502 to execute the data dimensionality reduction processing method.
  • the processor 502 is used to provide computing and control capabilities to support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for running the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the data dimension reduction processing method.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • the network interface 505 is used for network communication, such as providing transmission of data information.
  • FIG. 9 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory, so as to realize the corresponding functions in the above-mentioned data dimensionality reduction processing method.
  • the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific structure of the computer device.
  • the computer device may include more or less components than those shown in the drawings. Either some components are combined, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are the same as the embodiment shown in FIG. 9 , and details are not repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, wherein when the computer program is executed by the processor, the steps included in the above-mentioned data dimensionality reduction processing method are implemented.
  • the disclosed apparatus, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, or units with the same function may be grouped into one Units, such as multiple units or components, may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the read storage medium includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned computer-readable storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了数据降维处理方法、装置、计算机设备及存储介质。方法包括:根据信息量化规则对样本数据集中的样本数据进行量化得到特征量化信息,根据特征量化信息及降维比例值确定维度数量,根据维度数量及特征量化信息对样本数据进行分组并获取初始质心,对每组样本数据的初始执行进行迭代修正得到对应的目标质心,计算每一样本数据与每一目标质心之间的距离特征值作为每一样本数据的降维特征。本申请基于数据降维处理技术,属于大数据处理领域,在保留样本数据的全部属性信息的同时,实现了对样本数据的属性信息进行降维处理,基于进行降维后的降维特征对样本数据进行分析处理,可确保进行分析处理的准确性并大幅提高分析处理的效率。

Description

数据降维处理方法、装置、计算机设备及存储介质
本申请要求于2020年11月05日提交中国专利局、申请号为202011223586.X,发明名称为“数据降维处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据处理技术领域,属于智慧城市中对样本数据进行降维处理的应用场景,尤其涉及一种数据降维处理方法、装置、计算机设备及存储介质。
背景技术
随着网络科学技术的发展,网络已成为人们获取信息的重要途径,但是随着大数据时代到来导致的信息爆炸,使得计算机在信息搜寻时的任务负担加重,效率低下,准确获取到有效信息较为困难。通常而言,为了体现数据的特征并基于数据的特征方便地对数据进行准确筛选或分类,通常会获取与该数据相关的多个维度的属性信息用于对数据的特征进行描述,而数据的维度越多,计算机进行计算的任务负担则越重,则准确获取所需数据的效率越低。传统技术方法中,通常会对数据的多个维度属性进行针对性筛选,以保留对类别增益较高的属性,对数据的属性进行降维后可大大增加处理效率,但发明人发现这一对数据的属性进行筛选的方式由于无法保留数据的全部属性信息,导致在对数据进行筛选或分类时对准确性造成影响。因此,现有技术方法在对数据的属性进行降维后,存在无法完全保留数据全部属性信息的问题。
发明内容
本申请实施例提供了一种数据降维处理方法、装置、计算机设备及存储介质,旨在解决现有技术方法在对数据的属性进行降维时所存在的无法完全保留数据全部属性信息的问题。
第一方面,本申请实施例提供了一种数据降维处理方法,其包括:
若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息;
接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量;
根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心;
根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心;
计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
第二方面,本申请实施例提供了一种数据降维处理装置,其包括:
特征量化信息获取单元,用于若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息;
维度数量确定单元,用于接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量;
初始质心确定单元,用于根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心;
质心迭代修正单元,用于根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心;
降维特征获取单元,用于计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的数据降维处理方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的数据降维处理方法。
本申请实施例提供了一种数据降维处理方法、装置、计算机设备及存储介质。根据信息量化规则对样本数据集中的样本数据进行量化得到特征量化信息,根据特征量化信息及降维比例值确定维度数量,根据维度数量及特征量化信息对样本数据进行分组并获取初始质心,对每组样本数据的初始执行进行迭代修正得到对应的目标质心,计算每一样本数据与每一目标质心之间的距离特征值作为每一样本数据的降维特征。通过上述方法,在保留样本数据的全部属性信息的同时,实现了对样本数据的属性信息进行降维处理,在后续对样本数据进行筛选或分类等分析处理时,可基于降维特征对样本数据进行高效处理,由于保留了全部属性信息,可确保对数据进行分析处理的准确性并大幅提高对样本数据进行处理的效率。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的数据降维处理方法的流程示意图;
图2为本申请实施例提供的数据降维处理方法的子流程示意图;
图3为本申请实施例提供的数据降维处理方法的另一子流程示意图;
图4为本申请实施例提供的数据降维处理方法的另一子流程示意图;
图5为本申请实施例提供的数据降维处理方法的另一子流程示意图;
图6为本申请实施例提供的数据降维处理方法的另一子流程示意图;
图7为本申请实施例提供的数据降维处理方法的另一流程示意图;
图8为本申请实施例提供的数据降维处理装置的示意性框图;
图9为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1,图1是本申请实施例提供的数据降维处理方法的流程示意图,该数据降维处理方法应用于用户终端中,该方法通过安装于用户终端中的应用软件进行执行,用户终端即是用于执行数据降维处理方法以完成对样本数据进行降维处理的终端设备,例如台式电脑、笔记本电脑、平板电脑或手机等。如图1所示,该方法包括步骤S110~S150。
S110、若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息。
若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一样本数据的特征量化信息。其中,所述样本数据包含多项属性信息,所述量化规则包含多个量化项目,信息量化规则即为对样本数据集中样本数据的多项属性信息进行量化的具体规则,可将样本数据库中每一条样本数据的属性信息转换为特征量化信息进行量化表示,信息量化规则中的量化项目可等于或少于样本数据的属性信息项数。例如,若用户输入的样本数据集为客户信息数据集,客户信息数据集中的客户信息可包括客户性别、年龄、职业、兴趣爱好、月收入、婚姻状态、生育状态等属性信息,则可将客户信息数据集中每一客户的客户信息转换为特征量化信息进行量化表示。
在一实施例中,如图2所示,步骤S110包括子步骤S111、S112和S113。
S111、判断所述样本数据中与所述信息量化规则的每一量化项目对应的属性信息是否为数值;S112、若所述量化项目对应的属性信息为数值,根据所述量化项目的激活函数对所述属性信息进行计算得到所述属性信息的量化值;S113、若所述量化项目对应的属性信息不为数值,获取所述量化项目中与所述属性信息相匹配的关键字对应的数值作为所述属性信息的量化值。
样本数据集中的样本数据均包含多项属性信息,信息量化规则中每一量化项目均与一项属性信息相匹配,每一量化项目均可将样本数据中对应的一项属性信息转换为一个量化值进行表示,每一条样本数据对应的多个量化值即可组合为该样本数据的特征量化信息,特征量化信息可表示为一个多维的特征向量,也即是每一项属性信息对应特征量化信息中的一个维 度的特征向量,对每一量化项目对应的一项属性信息进行量化所得到量化值的范围均为[0,1]。具体的,可对属性信息是否为数值进行判断,若属性信息为数值则通过信息量化规则中与该属性信息相匹配的激活函数计算得到属性信息的量化值,若属性信息不为数值,则获取信息量化规则中与该属性信息相匹配的关键字所对应的数值作为该属性信息的量化值。
对于与量化项目对应的属性信息以数值方式表示的情况,信息量化规则中对应的量化规则为一个激活函数及一个中间值,根据激活函数对中间值及该量化项目的一个属性信息进行计算,即可得到对应的量化值。
例如,以样本数据集为客户信息数据集为例,激活函数可表示为:
Figure PCTCN2021091289-appb-000001
其中,x为与量化项目对应的一项信息,v为与该量化项目对应的中间值。与年龄这一量化项目对应的中间值为v=35,样本数据集中某客户信息的年龄为x=30,则根据上述激活函数计算得到对应的量化值为0.5357。信息量化规则的职业这一量化项目中包含“学生”、“医生”、“教师”及“程序员”四个关键字,与“学生”对应的数值为“0”、与“医生”对应的数值为“0.25”,与“教师”对应的数值为“0.6”,与“程序员”对应的数值为“1”,样本数据集中某客户信息的职业为教师,则对应的量化值为“0.6”。
S120、接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量。
用户可直接输入降维比例值,通过降维比例值及特征量化信息中特征向量的维度直接计算得到维度数量。具体的,将降维比例值与特征向量的维度数相乘并对乘积进行取整,即可计算得到维度数量,正常情况下,所得到的维度数量远小于特征量化信息中特征向量的维度数。
例如,特征量化信息中特征向量的维度数为41,降维比例值为0.15,计算41×0.15=6.15,取整后得到维度数量为6。
S130、根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心。
根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心。维度数量可由用户输入的降维比例值并结合特征量化信息中特征向量的维度数计算得到,维度数量即为对特征量化信息中特征向量的维度进行降维处理后所得降维特征中维度的数量信息。可通过维度数量对样本数据进行分组得到多组样本数据,并确定每一组样本数据的初始质心,则所得到的初始质心的数量与维度数量相等。
在一实施例中,如图3所示,步骤S130包括子步骤S131和S132。
S131、根据所述维度数量对所述样本数据进行随机分组得到多组样本数据;S132、从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
具体的,可根据维度数量对所有样本数据进行随机分组,得到多组样本数据,每组样本数据所包含样本数据的数量基本相等,从分组得到的多组样本数据中分别获取一条样本数据的特征量化信息,作为每组样本数据的初始质心。
在一实施例中,如图4所示,步骤S130还可以包括子步骤S1301、S1302和S132。
S1301、根据预置的分组模板、所述特征量化信息及所述维度数量构建数据分组模型;S1302、将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据;S132、从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
具体的,可根据分组模板、特征量化信息及维度数量构建数据分组模型,通过数据分组模型对样本数据进行分组得到多组样本数据,具体的,分组模板中可以包括全连接层,可基于特征量化信息构建得到输入节点、基于维度数量构建得到输出节点,将输入节点、输出节点及全连接层进行组合,即可得到基于神经网络的数据分组模型,则每一输入节点均对应特征量化信息中一个维度的量化值,每一输出节点均对应一个分组。输入节点与输出节点之间通过全连接层进行连接,全连接层中包含多个特征单元,输入节点与全连接层之间设置有第一公式组,输出节点与全连接层之间设置有第二公式组。其中,第一公式组包含所有输入节点至所有特征单元的公式,第一公式组中的公式均以输入节点值作为输入值、特征单元值作为输出值,第二公式组包含所有输出节点至所有特征单元的公式,第二公式组中的公式均以特征单元值作为输入值、输出节点值作为输出值,所得到的数据分类模型中所包含的每一公式中均拥有对应的参数值。输出节点值也即是特征量化信息与该输出节点对应的分组之间的匹配概率,根据计算得到某一样本数据的特征量化信息与每一输出节点的匹配概率,选择匹配概率最高的一个输出节点所对应的分组作为与该样本数据相匹配的分组,即可实现对样本数据进行分组并得到多组样本数据,此时所得到的每组样本数据所包含样本数据的数量可能存在较大差别。
在一实施例中,如图5所示,步骤S1302之前还可以包括步骤S1303。
S1303、若接收到用户输入的训练数据集,根据预置的梯度下降训练模型及所述训练数据集对所述数据分组模型进行迭代训练以得到训练后的数据分组模型。
其中输入训练数据集的即为用户终端的使用者,为使数据分组模型在对特征量化信息进行分组时可以有更高的准确率,需对数据分组模型进行迭代训练,也即是对数据分组模型的第一公式组及第二公式组中的参数值进行调整,训练后所得到的数据分组模型可以对特征量化信息进行更精准的分组。梯度下降训练模型即为对数据分组模型进行训练的模型,梯度下降训练模型中包括损失值计算公式及梯度计算公式,训练数据集中包含多条训练数据,每一条训练数据中均包含一条特征量化信息以及对应的分组标签;将一条特征量化信息输入数据分组模型得到该特征量化信息与每一输出节点对应的匹配概率,根据损失值计算公式及分组标签对输出节点对应的匹配概率进行计算即可得到对应的损失值,根据损失值及梯度计算公式即可计算得到第一公式组及第二公式组中每一参数对应的更新值,通过更新值即可对每一参数对应的参数值进行更新,这一对参数值进行更新的过程即为对数据分组模型进行训练的具体过程。
例如,损失值计算公式可表示为
Figure PCTCN2021091289-appb-000002
其中,f p为数据分组模型中与分组标签对应的一个输出节点的匹配概率,f n为第n个输出节点的匹配概率,f p及f n的取 值范围均为[0,1]。
根据所述梯度计算公式、所述损失值及所述数据分组模型的计算值计算得到所述数据分组模型中每一参数的更新值。具体的,将数据分组模型中一个参数对特征量化信息进行计算所得到的计算值输入梯度计算公式,并结合上述损失值,即可计算得到与该参数对应的更新值,这一计算过程也即为梯度下降计算。
具体的,梯度计算公式可表示为:
Figure PCTCN2021091289-appb-000003
其中,
Figure PCTCN2021091289-appb-000004
为计算得到的参数x的更新值,ω x为参数x的原始参数值,η为梯度计算公式中预置的学习率,
Figure PCTCN2021091289-appb-000005
为基于损失值及参数x对应的计算值对该参数x的偏导值(这一计算过程中需使用参数对应的计算值)。
基于所计算得到更新值对数据分组模型中每一参数的参数值对应更新,即完成对数据分组模型的一次训练过程。基于一次训练后所得到的数据分组模型对训练数据集中另一条训练数据再次进行计算处理,并重复上述训练过程,即可实现对数据分组模型进行迭代训练;当所计算得到的损失值小于预设的损失阈值或训练数据集中条训练数据均被用于训练后,即终止训练过程得到训练后的数据分组模型。
S140、根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心。
根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心。每组样本数据中均包含多条样本数据,可基于多组样本数据及初始质心进行迭代计算,对初始质心进行迭代修正得到对应的目标质心,每一初始质心经过迭代修正后得到对应的一个目标质心。
在一实施例中,如图6所示,步骤S140还包括子步骤S141、S142、S143、S144、S145和S146。
S141、获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值。
具体的,可计算一组样本数据中每一样本数据与每一初始质心之间的欧式距离,计算一条样本数据与一个初始质心之间的欧式距离可采用公式(1)计算得到:
Figure PCTCN2021091289-appb-000006
其中,某一条样本数据为C={c 1,c 2,…,c M},初始质心为O={o 1,o 2,…,o M},M为特征量化信息所包含的特征向量的维度数。
S142、根据所述样本数据与每一所述初始质心之间的距离值对所述样本数据进行重新分组。
计算得到每一条样本数据与每一初始质心之间的距离值之后,可选择样本数据的多个距离值中最小距离值的初始质心对该样本数据进行重新分组,对每一条样本数据进行重新分组后,得到重新分组的多组样本数据。重新分组只会调整样本数据的分组,而不会改变分组数量。
S143、计算重新分组的每组样本数据的特征量化平均值作为相应的修正质心。
重新分组后,计算每组样本数据的特征量化平均值,也即是计算每组样本数据中所包含样本数据的特征量化信息的平均值,将计算得到的特征量化平均值作为每组样本数据对应的修正质心。
在对初始质心进行预设次数的迭代修正后,可将迭代修正的修正质心作为与初始质心相对应的目标质心;还可通过预设迭代条件判断重新分组的每组样本数据是否满足预设迭代条件,若满足则继续进行迭代修正,若不满足则可将迭代修正的修正质心作为与初始质心相对应的目标质心。
S144、判断重新分组的每组样本数据是否满足预设迭代条件。
可对重新分组后所得的每组样本数据是否满足预设迭代条件进行判断,迭代条件中可配置距离阈值或均方误差阈值,可计算某一组样本数据中所有样本数据与该组样本数据的修正质心之间的距离值,计算距离值的平均值得到平均距离值,判断每组样本数据的平均距离值中是否有大于距离阈值的平均距离值,若有,则判定满足迭代条件;若每组样本数据的平均距离值均不大于距离阈值,判定不满足迭代条件。还可计算每一组样本数据中所有样本数据的与该组样本数据的修正质心之间的距离值,计算每组样本数据距离值的均方误差值,判断每组样本数据的均方误差值中是否有大于均方误差阈值的均方误差值,若有,则判定满足迭代条件;若每组样本数据的均方误差值均不大于均方误差阈值,判断不满足迭代条件。若迭代条件中同时包含距离阈值和均方误差阈值,则可通过两个阈值对多组样本数据进行综合判断,若有一组样本数据大于距离阈值或均方误差阈值,则判定满足迭代条件,否则判定不满足迭代条件。
S145、若满足所述迭代条件,将所述修正质心作为初始质心并返回执行所述获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值的步骤;S146、若不满足所述迭代条件,将所述修正质心作为目标质心。
若满足迭代条件,则返回执行步骤S141继续进行迭代修正,若不满足迭代条件,将当前所得到的修正质心作为与初始质心相对应的目标质心。
在一实施例中,如图7所示,步骤S140之后还包括步骤S1401。
S1401、根据所述样本数据集中每一样本数据的属性信息确定每一所述目标质心的特征标签。
对初始质心进行迭代修正的过程中,也同时存在对样本数据的重新分组,每一目标质心对应一组样本数据,基于进行重新分组后所得的多组样本数据的属性信息可确定得到每一目标质心的特征标签。具体的,可对每组样本数据的属性信息进行统计,得到每一组样本数据与每一项属性信息对应的统计结果,获取一组样本数据的统计结果中属性值超过预设比例值的属性值作为该组样本数据的特征标签,也即可确定得到每一目标质心的特征标签,目标质心的特征标签可用于对与该目标质心相对应的一组样本数据的特征进行表征,通过一组样本数据的特征标签即可了解该组样本数据整体的特征信息。
例如,预设比例值为75%,统计结果中性别这一项属性信息的属性值中,男性比例为20%,女性比例为80%,性别中女性超过预设比例,则将该属性值作为该组样本数据对应的目标质 心的特征标签。
S150、计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。具体的,根据样本数据的特征量化信息计算每一样本数据与目标质心之间的距离特征值,对距离特征值进行计算的具体过程也即是计算两者之间的欧式距离,将一条样本数据与每一目标质心之间的距离特征值组合,即可得到该样本数据的降维特征,目标质心的数量等于维度数量,则降维特征中所包含的距离特征值的数量也与维度数量相等,降维特征可通过与维度数量相等的一个多维特征向量进行表示。
例如,计算得到某一样本数据的降维特征为Jx={12.20,5.31,28.66,10.79,19.83,4.47},则实现了将该样本数据的特征量化信息所包含的41维特征向量进行降维处理,得到一个6维的特征向量作为该样本数据的降维特征Jx。
此外,还可通过样本数据的降维特征与该样本数据所属样本数据组的特征标签进行组合,得到样本数据的整体特征信息,样本数据的整体特征信息中的降维特征用于对该样本数据的特征进行量化表示,整体特征信息中的特征标签用于对该样本数据的特征以文字形式进行表征。
通过上述方法计算得到每一样本数据的降维特征,在保留样本数据的全部属性信息的同时,实现了对样本数据的属性信息进行降维处理,在后续对样本数据进行筛选或分类等分析处理时,可基于降维特征对样本数据进行高效处理,由于保留了全部属性信息,可确保对数据进行分析处理的准确性并大幅提高对样本数据进行处理的效率。
本申请中的技术方法可应用于智慧政务/智慧城管/智慧社区/智慧安防/智慧物流/智慧医疗/智慧教育/智慧环保/智慧交通等包含对样本数据进行降维处理的应用场景中,从而推动智慧城市的建设。
在本申请实施例所提供的数据降维处理方法中,根据信息量化规则对样本数据集中的样本数据进行量化得到特征量化信息,根据特征量化信息及降维比例值确定维度数量,根据维度数量及特征量化信息对样本数据进行分组并获取初始质心,对每组样本数据的初始执行进行迭代修正得到对应的目标质心,计算每一样本数据与每一目标质心之间的距离特征值作为每一样本数据的降维特征。通过上述方法,在保留样本数据的全部属性信息的同时,实现了对样本数据的属性信息进行降维处理,在后续对样本数据进行筛选或分类等分析处理时,可基于降维特征对样本数据进行高效处理,由于保留了全部属性信息,可确保对数据进行分析处理的准确性并大幅提高对样本数据进行处理的效率。
本申请实施例还提供一种数据降维处理装置,该数据降维处理装置用于执行前述数据降维处理方法的任一实施例。具体地,请参阅图8,图8是本申请实施例提供的数据降维处理装置的示意性框图。该数据降维处理装置可以配置于用户终端中。
如图8所示,数据降维处理装置100包括特征量化信息获取单元110、维度数量确定单元120、初始质心确定单元130、质心迭代修正单元140和降维特征获取单元150。
特征量化信息获取单元110,用于若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息。
在一实施例中,所述特征量化信息获取单元110包括子单元:属性信息判断单元、第一量化处理单元和第二量化处理单元。
属性信息判断单元,用于判断所述样本数据中与所述信息量化规则的每一量化项目对应的属性信息是否为数值;第一量化处理单元,用于若所述量化项目对应的属性信息为数值,根据所述量化项目的激活函数对所述属性信息进行计算得到所述属性信息的量化值;第二量化处理单元,用于若所述量化项目对应的属性信息不为数值,获取所述量化项目中与所述属性信息相匹配的关键字对应的数值作为所述属性信息的量化值。
维度数量确定单元120,用于接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量。
初始质心确定单元130,用于根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心。
在一实施例中,所述初始质心确定单元130包括子单元:随机分组单元和初始质心获取单元。
随机分组单元,用于根据所述维度数量对所述样本数据进行随机分组得到多组样本数据;初始质心获取单元,用于从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
在一实施例中,所述初始质心确定单元130包括子单元:数据分组模型构建单元、样本数据分组单元和初始质心获取单元。
数据分组模型构建单元,用于根据预置的分组模板、所述特征量化信息及所述维度数量构建数据分组模型;样本数据分组单元,用于将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据;初始质心获取单元,用于从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
在一实施例中,所述初始质心确定单元130还包括子单元:数据分组模型训练单元。
数据分组模型训练单元,用于若接收到用户输入的训练数据集,根据预置的梯度下降训练模型及所述训练数据集对所述数据分组模型进行迭代训练以得到训练后的数据分组模型。
质心迭代修正单元140,用于根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心。
在一实施例中,所述质心迭代修正单元140包括子单元:距离值获取单元、重新分组单元、修正质心获取单元、迭代判断单元、返回执行单元和目标执行获取单元。
距离值获取单元,用于获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值;重新分组单元,用于根据所述样本数据与每一所述初始质心之间的距离值对所述样本数据进行重新分组;修正质心获取单元,用于计算重新分组的每组样本数据的特征量化平均值作为相应的修正质心;迭代判断单元,用于判断重新分组的每组样本数据是否满足预设迭代条件;返回执行单元,用于若满足所述迭代条件,将所述修正质心作为初始质心并返回执 行所述获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值的步骤;目标执行获取单元,用于若不满足所述迭代条件,将所述修正质心作为目标质心。
在一实施例中,所述数据降维处理装置100还包括子单元:特征标签获取单元。
特征标签获取单元,用于根据所述样本数据集中每一样本数据的属性信息确定每一所述目标质心的特征标签。
降维特征获取单元150,用于计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
在本申请实施例所提供的数据降维处理装置应用上述数据降维处理方法,根据信息量化规则对样本数据集中的样本数据进行量化得到特征量化信息,根据特征量化信息及降维比例值确定维度数量,根据维度数量及特征量化信息对样本数据进行分组并获取初始质心,对每组样本数据的初始执行进行迭代修正得到对应的目标质心,计算每一样本数据与每一目标质心之间的距离特征值作为每一样本数据的降维特征。通过上述方法,在保留样本数据的全部属性信息的同时,实现了对样本数据的属性信息进行降维处理,在后续对样本数据进行筛选或分类等分析处理时,可基于降维特征对样本数据进行高效处理,由于保留了全部属性信息,可确保对数据进行分析处理的准确性并大幅提高对样本数据进行处理的效率。
上述数据降维处理装置可以实现为计算机程序的形式,该计算机程序可以在如图9所示的计算机设备上运行。
请参阅图9,图9是本申请实施例提供的计算机设备的示意性框图。该计算机设备可以是用于执行数据降维处理方法以对样本数据进行降维处理的用户终端。
参阅图9,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行数据降维处理方法。
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行数据降维处理方法。
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述的数据降维处理方法中对应的功能。
本领域技术人员可以理解,图9中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图9所示实施例一致,在此不 再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现上述的数据降维处理方法中所包含的步骤。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种数据降维处理方法,应用于用户终端中,其中,所述方法包括:
    若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息;
    接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量;
    根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心;
    根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心;
    计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
  2. 根据权利要求1所述的数据降维处理方法,其中,所述样本数据包含多项属性信息,所述量化规则包含多个量化项目,所述根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息,包括:
    判断所述样本数据中与所述信息量化规则的每一量化项目对应的属性信息是否为数值;
    若所述量化项目对应的属性信息为数值,根据所述量化项目的激活函数对所述属性信息进行计算得到所述属性信息的量化值;
    若所述量化项目对应的属性信息不为数值,获取所述量化项目中与所述属性信息相匹配的关键字对应的数值作为所述属性信息的量化值。
  3. 根据权利要求1所述的数据降维处理方法,其中,所述根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心,包括:
    根据所述维度数量对所述样本数据进行随机分组得到多组样本数据;
    从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
  4. 根据权利要求1所述的数据降维处理方法,其中,所述根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心,包括:
    根据预置的分组模板、所述特征量化信息及所述维度数量构建数据分组模型;
    将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据;
    从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
  5. 根据权利要求4所述的数据降维处理方法,其中,所述将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据之前,还包括:
    若接收到用户输入的训练数据集,根据预置的梯度下降训练模型及所述训练数据集对所述数据分组模型进行迭代训练以得到训练后的数据分组模型。
  6. 根据权利要求1所述的数据降维处理方法,其中,所述根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心,包括:
    获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值;
    根据所述样本数据与每一所述初始质心之间的距离值对所述样本数据进行重新分组;
    计算重新分组的每组样本数据的特征量化平均值作为相应的修正质心;
    判断重新分组的每组样本数据是否满足预设迭代条件;
    若满足所述迭代条件,将所述修正质心作为初始质心并返回执行所述获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值的步骤;
    若不满足所述迭代条件,将所述修正质心作为目标质心。
  7. 根据权利要求1所述的数据降维处理方法,其中,所述根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心之后,还包括:
    根据所述样本数据集中每一样本数据的属性信息确定每一所述目标质心的特征标签。
  8. 一种数据降维处理装置,包括:
    特征量化信息获取单元,用于若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息;
    维度数量确定单元,用于接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量;
    初始质心确定单元,用于根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心;
    质心迭代修正单元,用于根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心;
    降维特征获取单元,用于计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息;
    接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量;
    根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心;
    根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心;
    计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每 一所述样本数据的降维特征。
  10. 根据权利要求9所述的计算机设备,其中,所述样本数据包含多项属性信息,所述量化规则包含多个量化项目,所述根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息,包括:
    判断所述样本数据中与所述信息量化规则的每一量化项目对应的属性信息是否为数值;
    若所述量化项目对应的属性信息为数值,根据所述量化项目的激活函数对所述属性信息进行计算得到所述属性信息的量化值;
    若所述量化项目对应的属性信息不为数值,获取所述量化项目中与所述属性信息相匹配的关键字对应的数值作为所述属性信息的量化值。
  11. 根据权利要求9所述的计算机设备,其中,所述根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心,包括:
    根据所述维度数量对所述样本数据进行随机分组得到多组样本数据;
    从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
  12. 根据权利要求9所述的计算机设备,其中,所述根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心,包括:
    根据预置的分组模板、所述特征量化信息及所述维度数量构建数据分组模型;
    将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据;
    从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
  13. 根据权利要求12所述的计算机设备,其中,所述将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据之前,还包括:
    若接收到用户输入的训练数据集,根据预置的梯度下降训练模型及所述训练数据集对所述数据分组模型进行迭代训练以得到训练后的数据分组模型。
  14. 根据权利要求9所述的计算机设备,其中,所述根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心,包括:
    获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值;
    根据所述样本数据与每一所述初始质心之间的距离值对所述样本数据进行重新分组;
    计算重新分组的每组样本数据的特征量化平均值作为相应的修正质心;
    判断重新分组的每组样本数据是否满足预设迭代条件;
    若满足所述迭代条件,将所述修正质心作为初始质心并返回执行所述获取每一组样本数据中的样本数据与每一所述初始质心之间的距离值的步骤;
    若不满足所述迭代条件,将所述修正质心作为目标质心。
  15. 根据权利要求9所述的计算机设备,其中,所述根据所述多组样本数据对所述初始 质心进行迭代修正,以获取与每一初始质心相匹配的目标质心之后,还包括:
    根据所述样本数据集中每一样本数据的属性信息确定每一所述目标质心的特征标签。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
    若接收到用户输入的样本数据集,根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息;
    接收用户所输入的降维比例值,根据所述特征量化信息中特征向量的维度及所述降维比例值计算得到维度数量;
    根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心;
    根据所述多组样本数据对所述初始质心进行迭代修正,以获取与每一初始质心相匹配的目标质心;
    计算所述样本数据集中每一样本数据与每一所述目标质心之间的距离特征值,以得到每一所述样本数据的降维特征。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述样本数据包含多项属性信息,所述量化规则包含多个量化项目,所述根据预置的信息量化规则对所述样本数据集所包含的样本数据进行量化得到每一所述样本数据的特征量化信息,包括:
    判断所述样本数据中与所述信息量化规则的每一量化项目对应的属性信息是否为数值;
    若所述量化项目对应的属性信息为数值,根据所述量化项目的激活函数对所述属性信息进行计算得到所述属性信息的量化值;
    若所述量化项目对应的属性信息不为数值,获取所述量化项目中与所述属性信息相匹配的关键字对应的数值作为所述属性信息的量化值。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心,包括:
    根据所述维度数量对所述样本数据进行随机分组得到多组样本数据;
    从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述维度数量及每一所述样本数据的特征量化信息对所述样本数据进行分组得到多组样本数据并确定每组样本数据的初始质心,包括:
    根据预置的分组模板、所述特征量化信息及所述维度数量构建数据分组模型;
    将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据;
    从每组样本数据中分别获取一条样本数据的特征量化信息作为每组样本数据的初始质心。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述将每一条所述样本数据的特征量化信息依次输入所述数据分组模型以对所述样本数据进行分组,得到多组样本数据之 前,还包括:
    若接收到用户输入的训练数据集,根据预置的梯度下降训练模型及所述训练数据集对所述数据分组模型进行迭代训练以得到训练后的数据分组模型。
PCT/CN2021/091289 2020-11-05 2021-04-30 数据降维处理方法、装置、计算机设备及存储介质 WO2022095379A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011223586.X 2020-11-05
CN202011223586.XA CN112348079B (zh) 2020-11-05 2020-11-05 数据降维处理方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022095379A1 true WO2022095379A1 (zh) 2022-05-12

Family

ID=74428443

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091289 WO2022095379A1 (zh) 2020-11-05 2021-04-30 数据降维处理方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112348079B (zh)
WO (1) WO2022095379A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688034A (zh) * 2022-12-30 2023-02-03 浙江图胜数字科技有限公司 数值型和类别型的混合资料的提取和缩减方法
CN117745472A (zh) * 2023-12-21 2024-03-22 江苏省工程勘测研究院有限责任公司 基于轻量化传感模型的河道管理方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348079B (zh) * 2020-11-05 2023-10-31 平安科技(深圳)有限公司 数据降维处理方法、装置、计算机设备及存储介质
CN113592662B (zh) * 2021-07-30 2023-07-28 平安科技(深圳)有限公司 数据信息智能化处理方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930533A (zh) * 2012-10-09 2013-02-13 河海大学 一种基于改进k-均值聚类的半监督高光谱影像降维方法
CN105488303A (zh) * 2015-12-30 2016-04-13 浙江理工大学 一种基于特征距离集的腰腹臀体型分类方法及其测量装置
US20180150547A1 (en) * 2016-11-30 2018-05-31 Business Objects Software Ltd. Time series analysis using a clustering based symbolic representation
CN110502691A (zh) * 2019-07-05 2019-11-26 平安科技(深圳)有限公司 基于客户分类的产品推送方法、装置及可读存储介质
CN112348079A (zh) * 2020-11-05 2021-02-09 平安科技(深圳)有限公司 数据降维处理方法、装置、计算机设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6122628A (en) * 1997-10-31 2000-09-19 International Business Machines Corporation Multidimensional data clustering and dimension reduction for indexing and searching
US6134541A (en) * 1997-10-31 2000-10-17 International Business Machines Corporation Searching multidimensional indexes using associated clustering and dimension reduction information
US7797265B2 (en) * 2007-02-26 2010-09-14 Siemens Corporation Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters
CN108415958B (zh) * 2018-02-06 2024-06-21 北京陌上花科技有限公司 指数权重vlad特征的权重处理方法及装置
CN109242002A (zh) * 2018-08-10 2019-01-18 深圳信息职业技术学院 高维数据分类方法、装置及终端设备
CN111461180B (zh) * 2020-03-12 2024-07-09 平安科技(深圳)有限公司 样本分类方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930533A (zh) * 2012-10-09 2013-02-13 河海大学 一种基于改进k-均值聚类的半监督高光谱影像降维方法
CN105488303A (zh) * 2015-12-30 2016-04-13 浙江理工大学 一种基于特征距离集的腰腹臀体型分类方法及其测量装置
US20180150547A1 (en) * 2016-11-30 2018-05-31 Business Objects Software Ltd. Time series analysis using a clustering based symbolic representation
CN110502691A (zh) * 2019-07-05 2019-11-26 平安科技(深圳)有限公司 基于客户分类的产品推送方法、装置及可读存储介质
CN112348079A (zh) * 2020-11-05 2021-02-09 平安科技(深圳)有限公司 数据降维处理方法、装置、计算机设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115688034A (zh) * 2022-12-30 2023-02-03 浙江图胜数字科技有限公司 数值型和类别型的混合资料的提取和缩减方法
CN115688034B (zh) * 2022-12-30 2023-08-15 浙江图胜数字科技有限公司 数值型和类别型的混合资料的提取和缩减方法
CN117745472A (zh) * 2023-12-21 2024-03-22 江苏省工程勘测研究院有限责任公司 基于轻量化传感模型的河道管理方法及系统

Also Published As

Publication number Publication date
CN112348079B (zh) 2023-10-31
CN112348079A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
WO2022095379A1 (zh) 数据降维处理方法、装置、计算机设备及存储介质
Zhang et al. Detecting overlapping communities in networks using spectral methods
WO2022048173A1 (zh) 基于人工智能的客户意图识别方法、装置、设备及介质
WO2020143321A1 (zh) 一种基于变分自编码器的训练样本数据扩充方法、存储介质及计算机设备
CN107704625B (zh) 字段匹配方法和装置
WO2020007138A1 (zh) 一种事件识别的方法、模型训练的方法、设备及存储介质
WO2021089013A1 (zh) 空间图卷积网络的训练方法、电子设备及存储介质
WO2020114108A1 (zh) 聚类结果的解释方法和装置
WO2021098265A1 (zh) 缺失信息预测方法、装置、计算机设备及存储介质
CN109242002A (zh) 高维数据分类方法、装置及终端设备
WO2022126975A1 (zh) 客户信息校验方法、装置、计算机设备及存储介质
WO2020177142A1 (zh) 一种基于传递关系的本地自适应知识图谱优化方法
WO2023165139A1 (zh) 模型量化方法、装置、设备、存储介质及程序产品
CN112231592A (zh) 基于图的网络社团发现方法、装置、设备以及存储介质
CN112347246B (zh) 一种基于谱分解的自适应文档聚类方法及系统
Johnson On lasso for censored data
CN111159481B (zh) 图数据的边预测方法、装置及终端设备
CN116452333A (zh) 异常交易检测模型的构建方法、异常交易检测方法及装置
CN116304518A (zh) 用于信息推荐的异质图卷积神经网络模型构建方法及系统
KR20220116111A (ko) 인공 신경망의 추론 데이터에 대한 신뢰도를 판단하는 방법
WO2022116444A1 (zh) 文本分类方法、装置、计算机设备和介质
US20220374682A1 (en) Supporting Database Constraints in Synthetic Data Generation Based on Generative Adversarial Networks
CN116680401A (zh) 文档处理方法、文档处理装置、设备及存储介质
US20240086700A1 (en) Method Of Training Local Neural Network Model For Federated Learning
CN112463964B (zh) 文本分类及模型训练方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888083

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21888083

Country of ref document: EP

Kind code of ref document: A1