CN112329838B - Method and device for determining target set category label - Google Patents

Method and device for determining target set category label Download PDF

Info

Publication number
CN112329838B
CN112329838B CN202011203745.XA CN202011203745A CN112329838B CN 112329838 B CN112329838 B CN 112329838B CN 202011203745 A CN202011203745 A CN 202011203745A CN 112329838 B CN112329838 B CN 112329838B
Authority
CN
China
Prior art keywords
data
target
determining
category
sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011203745.XA
Other languages
Chinese (zh)
Other versions
CN112329838A (en
Inventor
徐成国
杨康
周星杰
王硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minglue Artificial Intelligence Group Co Ltd
Original Assignee
Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minglue Artificial Intelligence Group Co Ltd filed Critical Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority to CN202011203745.XA priority Critical patent/CN112329838B/en
Publication of CN112329838A publication Critical patent/CN112329838A/en
Application granted granted Critical
Publication of CN112329838B publication Critical patent/CN112329838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a device for determining a target set category label, wherein the method comprises the following steps: clustering a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2; determining a data type label of target data in the first set according to a set type label of the second set, wherein the second set comprises a plurality of sample data, and the sample data is data with the sample type label; and determining the target set category label of the first set according to the data category label of the target data in the first set. The method and the device solve the technical problem that the accuracy of the clustering algorithm for determining the clustering type result is low.

Description

Method and device for determining target set category label
Technical Field
The present disclosure relates to the field of computers, and in particular, to a method and apparatus for determining a target set category label.
Background
With the development of the Internet, the clustering algorithm is largely applied to the clustering distinction of the unlabeled data, is unsupervised, does not need a large amount of manual processing, is convenient to use and simple to realize, but has limited application scenes of the simple clustering algorithm in practical engineering application, is usually used as an auxiliary algorithm for some engineering application for final category distinction, and the current clustering algorithm can only gather the data into various sets and cannot determine the set category labels of the clustered various sets.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The application provides a method and a device for determining a target set class label, which are used for at least solving the technical problem that the accuracy of a clustering algorithm for determining a clustering class result in the related technology is low.
According to an aspect of an embodiment of the present application, there is provided a method for determining a target set category label, including: clustering a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2; determining a data category label of the target data in the first set according to a set category label of a second set, wherein the second set comprises a plurality of sample data, and the sample data is data with a sample category label; and determining a target set category label of the first set according to the data category label of the target data in the first set.
Optionally, determining the target set category label of the first set according to the data category label of the target data in the first set includes: establishing an confusion matrix according to the data category labels and the first sets, wherein data in rows of the confusion matrix represent identifications of the first sets, data in columns represent the data category labels of the target data in the first sets, and data in data areas of the confusion matrix represent proportions of the data category labels of the target data in the first sets; determining a plurality of target data meeting a target condition in a plurality of columns of the data area of the confusion matrix, wherein the target condition is determined based on the ratio of the sample data contained in each second set, and the rows corresponding to any two target data are different; and determining the data category label in the row corresponding to each piece of target data as the target set category label of the first set in the corresponding column.
Optionally, determining a plurality of the target data satisfying the target condition in a plurality of columns of the data area of the confusion matrix includes: determining a first scale value of the sample data contained in each of the second sets; and determining a plurality of target data meeting a second proportion value in a plurality of columns of the data area of the confusion matrix, wherein the difference value between the second proportion value and the first proportion value is smaller than a set threshold value.
Optionally, determining the data category label of the target data in the first set according to the set category label of the second set includes: determining the set category labels for each of the second sets; calculating the similarity between each target data and the set category labels of a plurality of second sets, and determining the maximum similarity value in all the similarities; and determining the set category label corresponding to the maximum similarity value as the data category label of the target data.
Optionally, determining the set category labels for each of the second sets includes: performing feature analysis on a plurality of sample data according to a target set number N to obtain N second sets, wherein N is a positive integer greater than 2; determining the set category labels of the second set according to the sample category labels of the sample data in the second set.
Optionally, calculating the similarity between each of the target data and the set category labels of the plurality of the second sets includes: calculating Euclidean distance between each target data and each set category label; and determining the similarity between each target data and each set category label based on the Euclidean distance, wherein the smaller the Euclidean distance is, the larger the similarity determined based on the Euclidean distance is.
According to another aspect of the embodiments of the present application, there is further provided a device for determining a target set category label, including: the clustering module is used for clustering the plurality of target data according to the number N of the target sets to obtain N first sets, wherein N is a positive integer greater than 2; the first determining module is used for determining a data category label of the target data in the first set according to a set category label of a second set, wherein the second set comprises a plurality of sample data, and the sample data is data with a sample category label; and the second determining module is used for determining the target set category label of the first set according to the data category label of the target data in the first set.
Optionally, the second determining module includes: a processing unit, configured to establish an confusion matrix according to the data category labels and the first sets, where data in a row of the confusion matrix represents an identifier of each first set, data in a column represents the data category labels of each target data in each first set, and each data in a data area of the confusion matrix represents a proportion occupied by the data category labels of each target data in the first set; a first determining unit, configured to determine, in a plurality of columns of the data area of the confusion matrix, a plurality of target data that satisfy a target condition, where the target condition is determined based on a ratio of the sample data included in each second set, and rows corresponding to any two target data are different; and a second determining unit, configured to determine the data category label in the row corresponding to each piece of target data as the target set category label of the first set in the corresponding column.
According to another aspect of the embodiments of the present application, there is also provided a storage medium including a stored program that when executed performs the above-described method.
According to another aspect of the embodiments of the present application, there is also provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the method described above by the computer program.
In the embodiment of the application, clustering is performed on a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2; determining a data type label of target data in the first set according to a set type label of the second set, wherein the second set comprises a plurality of sample data, and the sample data is data with the sample type label; according to the method for determining the target set class label of the first set according to the data class label of the target data in the first set, simple clustering can be carried out on the target data to be clustered according to the preset set number, a plurality of clustered first sets are obtained, the current first set is a set without the set class label, the sample data with the class label information is used for forming the second set, and the sample data in the second set is the data with the sample class label, so that the set class label of the second set can be determined, the data class label of the target data in the first set is determined through the second set with the set class label, the target set class label of the first set after clustering operation can be determined, the purpose of determining the target set class labels of all sets obtained by clustering the target data is achieved, the technical effect of improving the accuracy of determining the clustering class by a clustering algorithm is achieved, and the technical problem that the accuracy of determining the clustering class result by the clustering algorithm is low is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic diagram of a hardware environment of a method of determining a target set category label according to an embodiment of the present application;
FIG. 2 is a flow chart of an alternative method of determining a target-set category label according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative confusion matrix according to embodiments of the present application;
FIG. 4 is a flow chart of an alternative determination of target-set category labels according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an alternative target set category label determination apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of a terminal according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiments of the present application, a method embodiment for determining a target set category label is provided.
Alternatively, in the present embodiment, the above-described method of determining the target-set category label may be applied to a hardware environment constituted by the terminal 101 and the server 103 as shown in fig. 1. Fig. 1 is a schematic diagram of a hardware environment of a method for determining a target set class label according to an embodiment of the present application, as shown in fig. 1, where a server 103 is connected to a terminal 101 through a network, and may be used to provide services (such as a game service, an application service, etc.) to the terminal or a client installed on the terminal, and a database may be set on the server or independent of the server, to provide a data storage service to the server 103, where the network includes, but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, or the like. The method for determining the target set category label in the embodiment of the present application may be performed by the server 103, may be performed by the terminal 101, or may be performed by both the server 103 and the terminal 101. The method for determining the target set category label by the terminal 101 according to the embodiment of the present application may be performed by a client installed thereon.
FIG. 2 is a flowchart of an alternative method of determining target-set category labels, according to an embodiment of the present application, as shown in FIG. 2, the method may include the steps of:
step S202, clustering a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2;
step S204, determining a data category label of the target data in the first set according to a set category label of a second set, wherein the second set comprises a plurality of sample data, and the sample data is data with a sample category label;
step S206, determining a target set category tag of the first set according to the data category tag of the target data in the first set.
Through the steps S202 to S206, the target data to be clustered can be simply clustered according to the preset number of sets, so as to obtain a plurality of clustered first sets, the current first set is a set without a set type label, and the sample data with the type label information is used to form a second set, because the sample data in the second set is the data with the sample type label, the set type label of the second set can be determined, the data type label of the target data in the first set can be determined through the second set with the set type label, and therefore the target set type label of the first set after the clustering operation can be determined, the purpose of determining the target set type labels of all sets obtained by clustering the target data is achieved, the technical effect of improving the accuracy of determining the clustering type by a clustering algorithm is achieved, and the technical problem that the accuracy of determining the clustering type result by the clustering algorithm is low is solved.
Alternatively, in this embodiment, the method may be applied to, but not limited to, a data analysis field, an image processing field, a product recommendation field, and the like, for example: when the method is applied to the field of product recommendation of new users, the users can be scored according to a historical product model, but when the high-score users and the low-score users are recommended, the scoring of the users is ordered according to three scoring models, so that the scores cannot be compared, clustering judgment is needed by combining high-score or low-score user data with historical sample data, but due to the fact that the quality of characteristic data is low, the clustering can only obtain categories, the meaning of each category specifically represented (such as what commodity is represented) cannot be determined, and the product name required to be recommended to the new users can be determined through the method.
In the technical solution provided in step S202, the number N of target sets may be set randomly according to the user requirement, and the first set without the set category label may be obtained by clustering the target data.
Alternatively, in the present embodiment, the method of clustering the target data may be, but not limited to, applying a K-MEANS algorithm (K-MEANS Clustering Algorithm ), a K-MEDOIDS algorithm (K-center point algorithm), or the like.
In the technical solution provided in step S204, the data is also different according to the application field, and further the category labels of the data are also different, the category labels of the data may be gender, age, height, weight, commodity type, commodity price, etc., and the same data may have one or more category labels, for example, the category label of a certain data may be [ commodity name, commodity type, commodity price, time to shelf ].
Alternatively, in this embodiment, the second sets may be sets obtained by clustering a plurality of sample data according to their sample category labels, and each second set may have a corresponding set category label therein.
As an alternative embodiment, determining the target set category label of the first set from the data category labels of the target data in the first set includes:
s11, establishing an confusion matrix according to the data category labels and the first sets, wherein the rows of the confusion matrix represent the identification of each first set, the data category labels of each target data in each first set are listed, and each data in the data area of the confusion matrix represents the proportion of the data category labels of each target data in the first set;
S12, determining a plurality of target data meeting a target condition in a plurality of columns of the data area of the confusion matrix, wherein the target condition is determined based on the ratio of the sample data contained in each second set, and the rows corresponding to any two first data are different;
s13, determining the data category label in the row corresponding to each piece of target data as the target set category label of the first set in the corresponding column.
Through the steps, the confusion matrix of the data category labels and the first set is established, a plurality of target data meeting target conditions are selected from each column of a confusion matrix data area, and the data category labels corresponding to the target data are used as the target set category labels of the first set, so that the target set category labels of the first set can be determined efficiently and accurately.
As an alternative embodiment, determining the plurality of target data satisfying the target condition in a plurality of columns of the data area of the confusion matrix includes:
s21, determining a first proportion value of the sample data contained in each second set;
S22, determining a plurality of target data meeting a second proportion in a plurality of columns of the data area of the confusion matrix, wherein the difference value between the second proportion and the first proportion is smaller than a set threshold value.
Optionally, in this embodiment, the first ratio is a ratio of the number of sample data contained in each second set, for example, the number of sample data is 100, when the sample data is clustered into three second sets, the first set contains 70 sample data, the second set contains 20 sample data, the third set contains 10 sample data, and the first ratio is 7:2:1.
Alternatively, in the present embodiment, the threshold may be set, but is not limited to, including any value, such as: 0.1, 0.01, 0.001, 0.2, 0.02, etc., when the difference between the target data and the corresponding first scale value is less than the set threshold, then the plurality of target data determined in the current first set may be considered to meet the target requirement.
Alternatively, in the present embodiment, one target data is selected from each column, and when the scale value between the selected target data is closest to the first scale value, it may be determined that the target data selected in the plurality of columns of the data area is data satisfying the target condition.
Fig. 3 is a schematic diagram of an alternative confusion matrix according to an embodiment of the present application, as shown in fig. 3, where the number of target sets is set to 3, an abscissa label1 represents the identifier of each first set, where label1-1 to label1-3 represent 3 first sets obtained by clustering multiple target data according to the number of target sets 3, an ordinate label2 represents the data category label of each target data in each first set determined according to the set category label of the second set, where label2-1, label2-3 represent 3 set category labels respectively, and each data in the data area represents the proportion occupied by the data category label of each target data in the first set, for example: the total (label 1-1) represents the quantity of target data contained in the 1 st first set obtained by clustering, the meaning of the representation of total (label 1-2) and total (label 1-3) can be known in the same way, the cell a corresponding to the column label1-1 and the row label2-1 in the table represents the quantity of data of which the feature vector in the 1 st first set is nearest to the centroid vector of label2-1, and then the meaning of a/total (label-1) is that in the 1 st first set obtained by clustering, the proportion of the nearest set class label as the vector quantity of label2-2 is obtained by calculating the distance according to the Euclidean distance; according to the established confusion matrix, in addition to performing longitudinal dimension comparison (a/total (1-1), b/total (label 1-1.)) and continuing to perform transverse dimension comparison (a/total (1-1), f/total (1-2.)), ensuring that a second ratio value between the data determined in each column of the data area is close to a first ratio value of the sample data contained in the second set, such as clustering a plurality of sample data into 3 second sets, and the ratio is 7:2:1, if (label 2-1)/(label 1-1), (label 2-2)/(label 1-2), and (label 2-3)/(label 1-3) obtained by the above calculation are respectively the best label2 option of the three label1 cluster categories selected, then the three ratio values are expected to be infinitely close to 7:2:1. finally, mutually exclusive label2-1, label2-2 and label2-3 are target class labels which are actually represented and correspond to the first sets label1-1, label1-2 and label1-3 respectively.
As an alternative embodiment, determining the data category label of the target data in the first set according to the set category label of the second set includes:
s31, determining the set category labels of each second set;
s32, calculating the similarity between each target data and the set category labels of a plurality of second sets, and determining the maximum similarity value in all the similarities;
and S33, determining the set category label corresponding to the maximum similarity value as the data category label of the target data.
Optionally, in this embodiment, the method for determining the set class labels of the second set may, but is not limited to, calculating centroid vectors of the sample class labels of each sample data included in the second set, and for sample class labels of different feature dimensions, the centroid value calculating method is different, and finally, the set class labels of each second set may be obtained according to the number of sample classes, for example, one-hot encoding may be performed on discrete values such as a commodity name and a commodity class, and values may be normalized to [ -1,1], and for continuous values such as a commodity price and a commodity shelf time, average processing may be adopted, and if the original data feature is [ commodity name, commodity class, commodity price, shelf time ], for example, assuming that the sample class labels are classified into two classes according to commodity prices, 100 or more and 100 or less, then the set class labels may be [ one-hot (commodity name), one-hot (commodity class), average (all data of commodity price < 100), average (shelf time) ] and [ one-hot (commodity name), average (commodity price) and average (commodity price) of 100).
Alternatively, in the present embodiment, the method for calculating the similarity between the target data and the collection type tag may include, but is not limited to, euclidean distance method, manhattan distance method, and mahalanobis distance method, and the method for calculating the similarity between the target data and the collection type tag is not limited as long as the similarity can be calculated.
Through the steps, the similarity between the target data and the aggregate category label is calculated, and the aggregate category label with the maximum similarity with the target data is used as the data category label of the target data, so that the accuracy of the data category label of the target data is improved.
As an alternative embodiment, determining the set category labels for each of the second sets comprises:
s41, performing feature analysis on a plurality of sample data according to a target set number N to obtain N second sets, wherein N is a positive integer greater than 2;
s42, determining the set category label of the second set according to the sample category labels of the sample data in the second set.
Optionally, in this embodiment, feature engineering is performed according to features of sample data, so that a plurality of sample data may be divided into N data sets according to the target set number N.
As an alternative embodiment, calculating the similarity between each of the target data and the set category labels of the plurality of the second sets includes:
s51, calculating Euclidean distances between the target data and the collection category labels;
and S52, determining the similarity of each target data and each set category label based on the Euclidean distance, wherein the smaller the Euclidean distance is, the larger the similarity determined based on the Euclidean distance is.
Optionally, in this embodiment, the similarity between the target data and the set category label is calculated by using a euclidean distance method, which has the following calculation formula:in the formula, n represents the dimension of the feature vector of the data, x i An ith dimension value, y, representing a datum (which may be any one of target data and collection category labels) i Vector values representing the ith dimension of another data (which may be any one of the target data and the aggregate category label).
The present application also provides an alternative embodiment, and fig. 4 is a flowchart of an alternative determining a target set category label according to an embodiment of the present application, as shown in fig. 4:
S401, using a plurality of historical data with category labels as sample data, performing feature engineering on the plurality of sample data so as to perform feature analysis on the sample data.
S402, classifying the sample data after feature analysis according to the preset target set number to obtain a plurality of cluster sets, wherein the cluster sets do not have set category labels.
S403, clustering the target data according to the preset target set number by using a clustering algorithm.
S404, obtaining a plurality of clustered target data sets, wherein the target data sets do not have set category labels, and only the set label1 of which target data set the target data belongs to can be obtained.
S405, calculating the centroid vector of each cluster set according to the sample category labels of the sample data contained in the cluster sets in the step S402, thereby obtaining the set category labels of the respective sets.
S406, calculating Euclidean distance between each item of label data and the centroid vector of each cluster set according to the sample data, and marking the set category label corresponding to the centroid vector to which the minimum value of the Euclidean distance belongs as a sample category label2 of the sample data.
S407, constructing an confusion matrix formed by label1 and label2.
S408, determining the actual meaning of each set label1 according to the analysis result in the confusion matrix.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.
According to another aspect of the embodiments of the present application, there is also provided a target set category label determining apparatus for implementing the method for determining a target set category label described above. Fig. 5 is a schematic diagram of an alternative target set category label determining apparatus according to an embodiment of the present application, as shown in fig. 5, the apparatus may include:
the clustering module 52 is configured to cluster the plurality of target data according to the number N of target sets to obtain N first sets, where N is a positive integer greater than 2;
a first determining module 54, configured to determine a data class label of the target data in the first set according to a set class label of a second set, where the second set includes a plurality of sample data, and the sample data is data with a sample class label;
a second determining module 56 is configured to determine a target set category tag of the first set according to the data category tags of the target data in the first set.
It should be noted that, the initiating module 52 in this embodiment may be used to perform step S202 in the embodiment of the present application, the starting module 54 in this embodiment may be used to perform step S204 in the embodiment of the present application, and the sending module 56 in this embodiment may be used to perform step S206 in the embodiment of the present application.
It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or hardware as a part of the apparatus in the hardware environment shown in fig. 1.
Through the module, the technical problem that the accuracy of the clustering algorithm for determining the clustering category results is low can be solved, and the technical effect of improving the accuracy of the clustering algorithm for determining the clustering category is achieved.
As an alternative embodiment, the second determining module includes: a processing unit, configured to establish an confusion matrix according to the data category labels and the first sets, where data in a row of the confusion matrix represents an identifier of each first set, data in a column represents the data category labels of each target data in each first set, and each data in a data area of the confusion matrix represents a proportion occupied by the data category labels of each target data in the first set; a first determining unit, configured to determine, in a plurality of columns of the data area of the confusion matrix, a plurality of target data that satisfy a target condition, where the target condition is determined based on a ratio of the sample data included in each second set, and rows corresponding to any two target data are different; and a second determining unit, configured to determine the data category label in the row corresponding to each piece of target data as the target set category label of the first set in the corresponding column.
As an alternative embodiment, the first determining unit comprises: determining a first scale value of the sample data contained in each of the second sets; and determining a plurality of target data meeting a second proportion value in a plurality of columns of the data area of the confusion matrix, wherein the difference value between the second proportion value and the first proportion value is smaller than a set threshold value.
As an alternative embodiment, the first determining module includes: a third determining unit configured to determine the set category labels of each of the second sets; the computing unit is used for computing the similarity between each target data and the set category labels of the plurality of second sets and determining the maximum similarity value in all the similarities; and a fourth determining unit configured to determine the set category label corresponding to the maximum similarity value as the data category label of the target data.
As an alternative embodiment, the third determining unit comprises: performing feature analysis on a plurality of sample data according to a target set number N to obtain N second sets, wherein N is a positive integer greater than 2; determining the set category labels of the second set according to the sample category labels of the sample data in the second set.
As an alternative embodiment, the computing unit comprises: calculating Euclidean distance between each target data and each set category label; and determining the similarity between each target data and each set category label based on the Euclidean distance, wherein the smaller the Euclidean distance is, the larger the similarity determined based on the Euclidean distance is.
It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or in hardware as part of the apparatus shown in fig. 1, where the hardware environment includes a network environment.
According to another aspect of the embodiments of the present application, there is also provided a server or a terminal for implementing the method for determining a target set category label.
Fig. 6 is a block diagram of a terminal according to an embodiment of the present application, and as shown in fig. 6, the terminal may include: one or more (only one is shown in the figure) processors 601, memory 603, and transmission means 605, as shown in fig. 6, the terminal may further comprise an input output device 607.
The memory 603 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for determining a target set class label in the embodiment of the present application, and the processor 601 executes the software programs and modules stored in the memory 603, thereby performing various functional applications and data processing, that is, implementing the method for determining a target set class label. Memory 603 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 603 may further include memory remotely located with respect to the processor 601, which may be connected to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 605 is used to receive or transmit data via a network, and may also be used for data transmission between the processor and the memory. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 605 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 605 is a Radio Frequency (RF) module that is configured to communicate wirelessly with the internet.
In particular, the memory 603 is used to store applications.
The processor 601 may call an application program stored in the memory 603 through the transmission means 605 to perform the steps of: clustering a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2; determining a data category label of the target data in the first set according to a set category label of a second set, wherein the second set comprises a plurality of sample data, and the sample data is data with a sample category label; and determining a target set category label of the first set according to the data category label of the target data in the first set.
By adopting the embodiment of the application, a method and a device for determining the target set category label are provided. According to the method, simple clustering can be carried out on target data to be clustered according to the preset number of sets, a plurality of first sets after clustering are obtained, the current first set is a set without a set type label, sample data with type label information is used for forming a second set, and because the sample data in the second set is data with the sample type label, the set type label of the second set can be determined, the data type label of the target data in the first set can be determined through the second set with the set type label, the target set type label of the first set after clustering operation can be determined, the purpose of determining the target set type labels of all sets obtained by clustering the target data is achieved, the technical effect of improving the accuracy of determining the clustering type by a clustering algorithm is achieved, and the technical problem that the accuracy of determining the clustering type result by the clustering algorithm is lower is solved.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the structure shown in fig. 6 is only illustrative, and the terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 6 is not limited to the structure of the electronic device. For example, the terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 6, or have a different configuration than shown in fig. 6.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
Embodiments of the present application also provide a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used for executing the program code of the determination method of the target-set category label.
Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: clustering a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2; determining a data category label of the target data in the first set according to a set category label of a second set, wherein the second set comprises a plurality of sample data, and the sample data is data with a sample category label; and determining a target set category label of the first set according to the data category label of the target data in the first set.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (7)

1. A method for determining a target set class label, comprising:
clustering a plurality of target data according to the number N of target sets to obtain N first sets, wherein N is a positive integer greater than 2;
Determining a data category label of the target data in the first set according to a set category label of a second set, wherein the second set comprises a plurality of sample data, the sample data is data with a sample category label, and the data category label comprises: gender, age, height, weight, commodity type, commodity price, commodity name and time to shelf;
determining a target set category label for the first set from the data category labels for the target data in the first set includes: establishing an confusion matrix according to the data category labels and the first sets, wherein data in rows of the confusion matrix represent identifications of the first sets, data in columns represent the data category labels of the target data in the first sets, and data in data areas of the confusion matrix represent proportions of the data category labels of the target data in the first sets; determining a plurality of target data meeting a target condition in a plurality of columns of the data area of the confusion matrix, wherein the target condition is determined based on the ratio of the sample data contained in each second set, and the rows corresponding to any two target data are different; determining the data category label in the row corresponding to each piece of target data as the target set category label of the first set in the corresponding column, wherein the target set category label comprises: commodity names of commodities for recommendation to a user;
Determining a plurality of the target data satisfying the target condition in a plurality of columns of the data area of the confusion matrix includes: determining a first scale value of the sample data contained in each of the second sets; and determining a plurality of target data meeting a second proportion value in a plurality of columns of the data area of the confusion matrix, wherein the difference value between the second proportion value and the first proportion value is smaller than a set threshold value.
2. The method of claim 1, wherein determining the data category labels of the target data in the first set from the set category labels of the second set comprises:
determining the set category labels for each of the second sets;
calculating the similarity between each target data and the set category labels of a plurality of second sets, and determining the maximum similarity value in all the similarities;
and determining the set category label corresponding to the maximum similarity value as the data category label of the target data.
3. The method of claim 2, wherein determining the set category labels for each of the second sets comprises:
Performing feature analysis on a plurality of sample data according to a target set number N to obtain N second sets, wherein N is a positive integer greater than 2;
determining the set category labels of the second set according to the sample category labels of the sample data in the second set.
4. The method of claim 2, calculating a similarity between each of the target data and a plurality of the set category labels of the second set comprises:
calculating Euclidean distance between each target data and each set category label;
and determining the similarity between each target data and each set category label based on the Euclidean distance, wherein the smaller the Euclidean distance is, the larger the similarity determined based on the Euclidean distance is.
5. A target set class label determining apparatus, comprising:
the clustering module is used for clustering the plurality of target data according to the number N of the target sets to obtain N first sets, wherein N is a positive integer greater than 2;
the first determining module is configured to determine a data category tag of the target data in the first set according to a set category tag of a second set, where the second set includes a plurality of sample data, the sample data is data with a sample category tag, and the data category tag includes: gender, age, height, weight, commodity type, commodity price, commodity name and time to shelf;
A second determining module, configured to determine a target set category tag of the first set according to the data category tag of the target data in the first set, where the target set category tag includes: commodity names of commodities for recommendation to a user;
the second determining module includes:
a processing unit, configured to establish an confusion matrix according to the data category labels and the first sets, where data in a row of the confusion matrix represents an identifier of each first set, data in a column represents the data category labels of each target data in each first set, and each data in a data area of the confusion matrix represents a proportion occupied by the data category labels of each target data in the first set;
a first determining unit, configured to determine, in a plurality of columns of the data area of the confusion matrix, a plurality of target data that satisfy a target condition, where the target condition is determined based on a ratio of the sample data included in each second set, and rows corresponding to any two target data are different; determining a plurality of target data satisfying a target condition in a plurality of columns of the data area of the confusion matrix includes: determining a first scale value of the sample data contained in each of the second sets; determining a plurality of target data meeting a second proportion value in a plurality of columns of the data area of the confusion matrix, wherein the difference value between the second proportion value and the first proportion value is smaller than a set threshold value;
And a second determining unit, configured to determine the data category label in the row corresponding to each piece of target data as the target set category label of the first set in the corresponding column.
6. A storage medium comprising a stored program, wherein the program when run performs the method of any one of the preceding claims 1 to 4.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor performs the method of any of the preceding claims 1 to 4 by means of the computer program.
CN202011203745.XA 2020-11-02 2020-11-02 Method and device for determining target set category label Active CN112329838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011203745.XA CN112329838B (en) 2020-11-02 2020-11-02 Method and device for determining target set category label

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011203745.XA CN112329838B (en) 2020-11-02 2020-11-02 Method and device for determining target set category label

Publications (2)

Publication Number Publication Date
CN112329838A CN112329838A (en) 2021-02-05
CN112329838B true CN112329838B (en) 2024-02-02

Family

ID=74324120

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011203745.XA Active CN112329838B (en) 2020-11-02 2020-11-02 Method and device for determining target set category label

Country Status (1)

Country Link
CN (1) CN112329838B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451597A (en) * 2016-06-01 2017-12-08 腾讯科技(深圳)有限公司 A kind of sample class label method and device for correcting
CN108053268A (en) * 2017-12-29 2018-05-18 广州品唯软件有限公司 A kind of commercial articles clustering confirmation method and device
CN108388929A (en) * 2018-03-27 2018-08-10 四川大学 Client segmentation method and device based on cost-sensitive and semisupervised classification
CN109214421A (en) * 2018-07-27 2019-01-15 阿里巴巴集团控股有限公司 A kind of model training method, device and computer equipment
CN109522424A (en) * 2018-10-16 2019-03-26 北京达佳互联信息技术有限公司 Processing method, device, electronic equipment and the storage medium of data
CN109816047A (en) * 2019-02-19 2019-05-28 北京达佳互联信息技术有限公司 Method, apparatus, equipment and the readable storage medium storing program for executing of label are provided
CN110019774A (en) * 2017-08-23 2019-07-16 腾讯科技(深圳)有限公司 Label distribution method, device, storage medium and electronic device
CN110276382A (en) * 2019-05-30 2019-09-24 平安科技(深圳)有限公司 Listener clustering method, apparatus and medium based on spectral clustering
CN110399564A (en) * 2019-07-23 2019-11-01 腾讯科技(深圳)有限公司 Account number classification method and device, storage medium and electronic device
CN110413856A (en) * 2019-08-05 2019-11-05 腾讯科技(深圳)有限公司 Classification annotation method, apparatus, readable storage medium storing program for executing and equipment
CN111598120A (en) * 2020-03-31 2020-08-28 宁波吉利汽车研究开发有限公司 Data labeling method, equipment and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8805836B2 (en) * 2008-08-29 2014-08-12 Fair Isaac Corporation Fuzzy tagging method and apparatus
US10489722B2 (en) * 2017-07-27 2019-11-26 Disney Enterprises, Inc. Semiautomatic machine learning model improvement and benchmarking
US11410029B2 (en) * 2018-01-02 2022-08-09 International Business Machines Corporation Soft label generation for knowledge distillation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451597A (en) * 2016-06-01 2017-12-08 腾讯科技(深圳)有限公司 A kind of sample class label method and device for correcting
CN110019774A (en) * 2017-08-23 2019-07-16 腾讯科技(深圳)有限公司 Label distribution method, device, storage medium and electronic device
CN108053268A (en) * 2017-12-29 2018-05-18 广州品唯软件有限公司 A kind of commercial articles clustering confirmation method and device
CN108388929A (en) * 2018-03-27 2018-08-10 四川大学 Client segmentation method and device based on cost-sensitive and semisupervised classification
CN109214421A (en) * 2018-07-27 2019-01-15 阿里巴巴集团控股有限公司 A kind of model training method, device and computer equipment
CN109522424A (en) * 2018-10-16 2019-03-26 北京达佳互联信息技术有限公司 Processing method, device, electronic equipment and the storage medium of data
CN109816047A (en) * 2019-02-19 2019-05-28 北京达佳互联信息技术有限公司 Method, apparatus, equipment and the readable storage medium storing program for executing of label are provided
CN110276382A (en) * 2019-05-30 2019-09-24 平安科技(深圳)有限公司 Listener clustering method, apparatus and medium based on spectral clustering
CN110399564A (en) * 2019-07-23 2019-11-01 腾讯科技(深圳)有限公司 Account number classification method and device, storage medium and electronic device
CN110413856A (en) * 2019-08-05 2019-11-05 腾讯科技(深圳)有限公司 Classification annotation method, apparatus, readable storage medium storing program for executing and equipment
CN111598120A (en) * 2020-03-31 2020-08-28 宁波吉利汽车研究开发有限公司 Data labeling method, equipment and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Genetic Algorithm and Confusion Matrix for Document Clustering;A. K. Santra等;《IJCSI International Journal of Computer Science Issues》;第1-7页 *
基于聚类与分类结合的多示例预测算法研究;顾世忍;《万方数据》;第1-55页 *

Also Published As

Publication number Publication date
CN112329838A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN108363821A (en) A kind of information-pushing method, device, terminal device and storage medium
CN110362677B (en) Text data category identification method and device, storage medium and computer equipment
KR20100039773A (en) The method and apparatus for image recommendation based on user profile using feature based collaborative filtering to resolve new item recommendation
CN113127633B (en) Intelligent conference management method and device, computer equipment and storage medium
CN109961080B (en) Terminal identification method and device
CN110782318A (en) Marketing method and device based on audio interaction and storage medium
CN111178949A (en) Service resource matching reference data determination method, device, equipment and storage medium
CN115982463A (en) Resource recommendation method, device, equipment and storage medium
CN108770002A (en) Base station flow analysis method, device, equipment and storage medium
CN111538909A (en) Information recommendation method and device
CN113656699A (en) User feature vector determination method, related device and medium
CN113327132A (en) Multimedia recommendation method, device, equipment and storage medium
CN117217710A (en) Intelligent management method and system for virtual commodity and shortcut service
CN112329838B (en) Method and device for determining target set category label
CN111782774B (en) Method and device for recommending problems
CN111460113A (en) Data interaction method and related equipment
CN113448876B (en) Service testing method, device, computer equipment and storage medium
CN110852338A (en) User portrait construction method and device
CN115187330A (en) Product recommendation method, device, equipment and medium based on user label
CN114330519A (en) Data determination method and device, electronic equipment and storage medium
CN114358023A (en) Intelligent question-answer recall method and device, computer equipment and storage medium
CN113627542A (en) Event information processing method, server and storage medium
CN110826582B (en) Image feature training method, device and system
CN112819078B (en) Iteration method and device for picture identification model
CN113807749B (en) Object scoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant