CN107291722B

CN107291722B - Descriptor classification method and device

Info

Publication number: CN107291722B
Application number: CN201610195403.5A
Authority: CN
Inventors: 吴美玲
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2016-03-30
Filing date: 2016-03-30
Publication date: 2020-12-04
Anticipated expiration: 2036-03-30
Also published as: CN107291722A

Abstract

A method and equipment for classifying descriptors can obtain a classification model based on feature data of descriptor samples and class training corresponding to the descriptor samples; and then classifying the descriptors to be classified based on the classification model, updating the descriptor sample set according to the obtained classification result, updating the classification model based on the updated descriptor sample set, and classifying the descriptors to be classified in the descriptor set to be classified based on the updated classification model. That is to say, the descriptors with the most information content can be selected from a large number of unclassified descriptors in a loop iteration mode to be automatically marked and updated into the existing descriptor sample set, so that the training set of the classification model is expanded, the robustness and the classification precision of the classification model are improved, and the accuracy of the descriptor classification result can be improved on the basis of saving the human resource consumption.

Description

Descriptor classification method and device

Technical Field

The application relates to the technical field of data processing, in particular to a descriptor classification method and device.

Background

With the continuous development of the e-commerce technology, descriptors, such as brands and categories of commodity objects, used for describing attribute features of the commodity objects on the network platform also tend to be diversified, become much more and more complicated, bring great difficulty to users for finding high-quality descriptors, and reduce application experience of the users.

In order to solve the above problems, the following methods are often adopted in the industry to determine and push a corresponding high-quality descriptor to a user, so as to help the user quickly find the corresponding high-quality descriptor, and improve the application experience of the user:

the first method is as follows: and (4) identifying and selecting the high-quality descriptors in a manual mode. For example, in the case of descriptors such as brands, the brand operator may manually select a corresponding brand of good quality according to experience and push the brand to the user.

However, when the high-quality descriptors are selected in this way, a large number of operators are often required to participate, so that a large amount of labor cost is consumed; in addition, because the mode is mainly performed manually by operators according to experience, experience errors are inevitable, and the problems that the selection efficiency of the high-quality descriptors is low, the richness and the accuracy of the selected high-quality descriptors are low, the requirements of users cannot be met and the like also exist.

The second method comprises the following steps: and establishing a classification model to distinguish the unlabeled descriptors by utilizing a small amount of accumulated marked high-quality descriptor samples so as to determine the corresponding high-quality descriptors.

Although the high-quality description words are selected in the mode, the selection efficiency of the high-quality description words can be improved to a certain degree, and the labor cost is saved. However, in the process of modeling the classification model, the proportion of the marked high-quality descriptor samples is small (about 0.1% of all samples), so that the robustness and accuracy of the established classification model are poor, the accuracy of the high-quality descriptors obtained based on the established classification model is low, and the effect is not good.

That is, the existing determination methods of high-quality descriptors, i.e., the classification methods of descriptors, have a problem that the obtained result is inaccurate to some extent.

Disclosure of Invention

The embodiment of the application provides a descriptor classification method and device, which are used for solving the problem that the classification result of the conventional descriptor classification method is inaccurate.

The embodiment of the application provides a descriptor classification method, which comprises the following steps:

determining a descriptor set to be classified and feature data of each descriptor to be classified in the descriptor set to be classified;

classifying the descriptors to be classified in the descriptor set to be classified based on a set classification model, and predicting the category of each descriptor to be classified; the set classification model is obtained by training according to the feature data of each descriptor sample in the descriptor sample set and the category corresponding to each descriptor sample;

based on the prediction result, selecting descriptors which meet the following conditions from the descriptor set to be classified: the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptor and the descriptor;

adding the screened descriptors into a descriptor sample set in a mode that the corresponding categories are the categories of the descriptors obtained through prediction at this time to obtain an updated descriptor sample set, and deleting the screened descriptors from the descriptor set to be classified to obtain an updated descriptor set to be classified;

updating the set classification model based on the updated descriptor sample set; and classifying the descriptors to be classified in the descriptor set to be classified based on the updated classification model.

Correspondingly, the embodiment of the application also provides a descriptor sorting device, which comprises:

the data acquisition module is used for determining a descriptor set to be classified and the feature data of each descriptor to be classified in the descriptor set to be classified;

the classification module is used for classifying the descriptors to be classified in the descriptor set to be classified based on a set classification model and predicting the category of the descriptors to be classified; the set classification model is obtained by training according to the feature data of each descriptor sample in the descriptor sample set and the category corresponding to each descriptor sample; and are

Based on the prediction result, selecting descriptors which meet the following conditions from the descriptor set to be classified: the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptor and the descriptor; and the number of the first and second groups,

adding the screened descriptors into a descriptor sample set in a mode that the corresponding categories are the categories of the descriptors obtained through prediction at this time to obtain an updated descriptor sample set, and deleting the screened descriptors from the descriptor set to be classified to obtain an updated descriptor set to be classified; and the number of the first and second groups,

The beneficial effect of this application is as follows:

the embodiment of the application provides a descriptor classification method and device, which can be used for obtaining a classification model based on feature data of descriptor samples and class training corresponding to the descriptor samples; and then classifying the descriptors to be classified based on the classification model, updating the descriptor sample set according to the obtained classification result, updating the classification model based on the updated descriptor sample set, and classifying the descriptors to be classified in the descriptor set to be classified based on the updated classification model. That is to say, the descriptors with the most information content can be selected from a large number of unclassified descriptors in a loop iteration mode to be automatically marked and updated into the existing descriptor sample set, so that the training set of the classification model is expanded, the robustness and the classification precision of the classification model are improved, and the accuracy of the descriptor classification result can be improved on the basis of saving the human resource consumption.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a descriptor classification method according to a first embodiment of the present application;

fig. 2 is a schematic structural diagram of a descriptor sorting apparatus in the second embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The first embodiment is as follows:

the first embodiment of the present application provides a descriptor classification method, and specifically, as shown in fig. 1, it is a flowchart of steps of the method in the first embodiment of the present application, and the method may include the following steps:

step 101: determining a descriptor set to be classified and feature data of each descriptor to be classified in the descriptor set to be classified;

step 102: classifying the descriptors to be classified in the descriptor set to be classified based on a set classification model, and predicting the category of each descriptor to be classified; the set classification model is obtained by training according to the feature data of each descriptor sample in the descriptor sample set and the category corresponding to each descriptor sample;

step 103: based on the prediction result, selecting descriptors which meet the following conditions from the descriptor set to be classified: the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptor and the descriptor;

step 104: adding the screened descriptors into a descriptor sample set in a mode that the corresponding categories are the categories of the descriptors obtained through prediction at this time to obtain an updated descriptor sample set, and deleting the screened descriptors from the descriptor set to be classified to obtain an updated descriptor set to be classified;

step 105: updating the set classification model based on the updated descriptor sample set; and classifying the descriptors to be classified in the descriptor set to be classified based on the updated classification model.

That is to say, in the descriptor classification method provided in this embodiment, a classification model may be obtained based on feature data of each descriptor sample and class training corresponding to each descriptor sample; and then classifying the descriptors to be classified based on the classification model, updating the descriptor sample set according to the obtained classification result, updating the classification model based on the updated descriptor sample set, and classifying the descriptors to be classified in the descriptor set to be classified based on the updated classification model. Namely, descriptors with the most information quantity can be selected from a large number of unclassified descriptors in a loop iteration mode to be automatically marked, and the descriptors are updated into an existing descriptor sample set to expand a training set of a classification model and improve the robustness and classification precision of the classification model, so that the accuracy of descriptor classification results can be improved on the basis of saving human resource consumption.

The individual steps of the method will now be described in detail:

optionally, in step 101, each descriptor of which the category is not determined yet may be used as a descriptor to be classified to obtain a corresponding set of descriptors to be classified, and further, feature data of each descriptor to be classified in the set of descriptors to be classified may be determined.

Wherein the feature data of each descriptor may include first feature data for characterizing an own attribute of the descriptor and second feature data for characterizing a feature of a user associated with the descriptor.

That is to say, in this embodiment, when classifying the descriptors, not only the characteristics of the descriptors themselves but also the characteristics of the user associated with the descriptors may be considered, so that the selection dimension of the feature data of the descriptors is expanded, and the accuracy of classifying the subsequent descriptors is improved.

Further, the first characteristic data of each descriptor may include at least any one or more of a traffic path source proportion, a tonality characteristic, a quality characteristic, a qualification characteristic, a popularity characteristic, and a price hierarchy characteristic of the descriptor.

Wherein, for any descriptor, the traffic path source ratio of the descriptor can be obtained by the following modes:

the ratio of the active flow (for example, the flow caused by the operation of a user for actively searching, collecting, or adding a shopping cart to a commodity object) to the commodity object page under the descriptor to the passive flow (for example, the flow caused by the corresponding commodity object promotion activity) to the commodity object page under the descriptor, or the ratio of the passive flow to the commodity object page under the descriptor to the active flow to the commodity object page under the descriptor, and the like are summarized in units of a set time period (the set time period can be flexibly set according to actual conditions, and can be set to be, for example, half a month, three months, six months, or one year).

In addition, for any descriptor, the tonality characteristic, the quality characteristic, the qualification characteristic, the awareness characteristic, or the price hierarchy characteristic of the descriptor may be obtained by statistically analyzing data such as the tonality of the descriptor (i.e., the evaluation or the overall impression of the user on the descriptor or the commodity object under the descriptor), the quality (i.e., the quality of the commodity object under the descriptor), the qualification (i.e., the eligibility or the legitimacy of the descriptor), the awareness (i.e., the recognizability of the descriptor), or the covered price range within a set time period (e.g., half a month, three months, six months, or one year), and thus will not be described again.

Further, the second characteristic data of each descriptor may include any one or more of a number ratio of users associated with the descriptor at each age level, a number ratio at each gender level, a number ratio at each purchasing power level, a number ratio at each discount rate level, a number ratio at each activity level, a number ratio at each buyback rate level associated with the descriptor, and a number ratio at each return rate level associated with the descriptor.

Optionally, the second characteristic data of the descriptor may be determined by:

step A1: for any user (such as any user who pays attention to or collects the descriptor) associated with the descriptor, by performing statistical analysis on historical behavior data (such as commodity object browsing data, deal data, search data, etc.) of the user within a set time period (such as half a month, three months, six months, or one year, etc.), the age level (for example, the age level may be 18 or less, 19 to 25, 26 to 35, or 36 or more, etc.), the gender, the purchasing power level (for example, the purchasing power level may be 500 or less, 501 to 2000, 2001 to 4000, or 4001 or more, etc.), the discount rate level (wherein, the discount rate refers to the ratio of the actual amount of the deal of the user to the original amount of the corresponding commodity object, and the discount rate level may be three levels of five folds or less, 5 to 7 folds, or more) of the user, and the discount rate level (for example, the discount rate of the deal may be three levels of five folds, 5 to 7 folds, or more) of the descriptor) are obtained, An activity level (wherein, the activity of the user may be obtained by performing weighted calculation on operation behavior data of the user in a set time period, such as user behavior data obtained by performing specified behaviors on a commodity object, such as clicking, collecting, buying, purchasing and the like), a buyback rate level related to the descriptor (wherein, the buyback rate related to the descriptor is the ratio of the number of times that the user purchases the commodity object under the descriptor in a set number of days to the set number of days), and a buyback rate level related to the descriptor (wherein, the buyback rate related to the descriptor is the ratio of the number of times that the user browses the commodity object under the descriptor in the set number of days to the set number of days), so as to obtain feature data of the user;

step A2: counting a user number ratio of users associated with the descriptor at each age level (e.g., assuming that an age level may be four levels of 18 and below, 19 to 25, 26 to 35, or 36 and above, a ratio of the number of users associated with the descriptor at each of the above-mentioned levels to the total number of users associated with the descriptor), a user number ratio at each gender level, a user number ratio at each purchasing power level, a user number ratio at each discount rate level, a user number ratio at each activity level, a user number ratio at each buyback rate level associated with the descriptor, and obtaining second characteristic data of the descriptor by comparing the number of users on each return visit rate level related to the descriptor.

Optionally, in order to reduce the complexity of data processing and at the same time to further improve the accuracy of descriptor classification, in the present embodiment, when determining the second feature data of each descriptor for characterizing the features of the user associated with the descriptor, only the second feature data for characterizing the features of the key user associated with the descriptor may be generally determined.

Wherein, for any descriptor, the key user of the descriptor can be determined by the following modes:

step B1: acquiring all users related to the descriptor, namely all concerned users of the descriptor, such as users who concern or collect the descriptor and/or commodity objects under the descriptor;

step B2: acquiring historical operation behavior data of each concerned user and related to the descriptor;

for any concerned user, the historical operation behavior data of the concerned user related to the descriptor may be user behavior data obtained by counting specified behaviors (such as click, collection, purchase and the like) made by the concerned user on each commodity object under the descriptor within a set time period (such as half a month, three months, six months, one year and the like), such as any one or more of click times, collection times, purchase times and the like;

step B3: according to the historical operation behavior data of each concerned user and related to the descriptor and the weight corresponding to each historical operation behavior data (the weight corresponding to each historical operation behavior data can be flexibly set according to the actual situation), weighting and summing the historical operation behavior data of each concerned user and related to the descriptor to obtain the preference score of each concerned user to the descriptor, namely the heat degree of each concerned user and related to the descriptor;

step B4: according to the heat degree of each concerned user related to the description word, selecting the user with the heat degree related to the description word not less than a set heat degree threshold (the heat degree threshold can be flexibly set according to actual conditions) as the key user of the description word.

That is, the key user of each descriptor refers to a user whose heat degree related to the descriptor is not less than a set heat degree threshold (which can be flexibly set), and details thereof are not repeated.

Further, after the step 101 is executed, before the step 102 is executed, or while or before the step 101 is executed, a descriptor sample set (where a descriptor sample refers to a descriptor whose category is known explicitly) may be determined first, and a corresponding classification model is obtained according to feature data of each descriptor sample in the descriptor sample set and a category corresponding to each descriptor sample. For example, feature data of each descriptor sample in the descriptor sample set and a category corresponding to each descriptor sample may be trained based on an algorithm such as an SVM (Support Vector Machine), so as to obtain a corresponding classification model, which is not limited.

Similar to the description of the feature data of the descriptor to be classified, the feature data of each descriptor sample may also include first feature data used for characterizing the attribute of the descriptor sample, and second feature data used for characterizing the features of the user (such as a key user) associated with the descriptor sample, which is not described in detail herein.

Optionally, the descriptor sample in the descriptor sample set may include a descriptor positive sample, and a descriptor negative sample, where:

the descriptor positive sample refers to a descriptor sample of which the comprehensive evaluation index is not lower than a set first index threshold (can be flexibly set);

the descriptor negative sample refers to a descriptor sample of which the comprehensive evaluation index is not higher than a set second index threshold (can be flexibly set); wherein the second exponent threshold is not higher than the first exponent threshold; the comprehensive evaluation index of each descriptor is a parameter for characterizing the performance of the descriptor, which is determined according to the feature data of the descriptor (for example, the parameter may be obtained by weighting and summing the feature data of the descriptor according to the weight corresponding to each feature data, which is not limited).

It should be noted that, in this embodiment, in order to improve the accuracy of establishing the classification model so as to make the classification result more accurate, the descriptor sample, such as the descriptor positive sample, may be manually selected by an operator according to feature data of the descriptor. For example, descriptors with a higher number ratio on a high purchasing power level, a higher number ratio on a high traffic discount rate level, a higher number ratio on a high activity level, a higher number ratio on a high repurchase rate level, or a higher number ratio on a high repurchase rate level can be manually screened out as descriptor positive samples; or, descriptors with low tonality, or high number ratio of users on the low buyback rate level (namely, buyback rate is at the bottom in the industry) can be manually screened out as descriptor negative samples.

Of course, the descriptor sample can also be automatically selected by the system according to the feature data of the descriptor without manual participation and without limitation.

In addition, in this embodiment, in order to ensure the balance of sample selection, the number of descriptor negative samples in the descriptor sample set may be not less than the number of descriptor positive samples, for example, may be 1 time or 3 times the number of descriptor positive samples, and details thereof are not repeated.

Furthermore, it should be noted that, because the category of the descriptor may not be only good or bad, for example, the descriptor is taken as a brand, the category of the good brand may be specifically subdivided into international brand, fashion brand, national famous brand, original brand, and the like, when selecting the descriptor sample, besides distinguishing the positive and negative samples, the category (such as the corresponding category identifier and the like) to which each descriptor sample belongs needs to be definitely determined, and further description is not repeated.

Further, since after obtaining the classification model, the classification accuracy of the classification model may be generally tested to verify whether it meets the set requirement, in this embodiment, in addition to obtaining the descriptor sample set (may be simply referred to as descriptor training sample set) for performing the training of the classification model, a corresponding descriptor sample set (may be simply referred to as descriptor testing sample set) for testing the classification accuracy of the classification model may be generally obtained to test the classification accuracy of the classification model based on the descriptor testing sample set.

The selection of the descriptor test sample set is similar to that of the descriptor training sample set, and can be performed manually or automatically by a system; the samples included in the above description may include positive samples or negative samples, and the details thereof are not repeated herein.

In addition, it should be noted that the number of samples in the descriptor test sample set may be generally smaller than the number of samples in the descriptor training sample set, for example, the ratio of the two may be 3: 7, etc., which are not described in detail herein.

Further, after the corresponding classification model is obtained through training, the operations of steps 102 to 105 may be performed.

Optionally, on the basis of the prediction result in step 102, descriptors meeting the following conditions are screened out from the set of descriptors to be classified: before the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptors, the method may further include:

determining a comprehensive evaluation index of each descriptor to be classified in the descriptor set to be classified;

according to the comprehensive evaluation index of each descriptor to be classified, screening out a first set number (flexibly settable) of descriptors to be classified, wherein the comprehensive evaluation index is not less than a set third index threshold (flexibly settable) to form a first positive candidate sample set, and screening out a second set number (flexibly settable) of descriptors to be classified, wherein the comprehensive evaluation index is not more than a set fourth index threshold (flexibly settable) to form a first negative candidate sample set, wherein the fourth index threshold is not higher than the third index threshold. In addition, in order to ensure the balance of the selection of the positive and negative samples, the first set number may be equal to the second set number, and of course, may not be equal to each other, which is not described herein again.

Correspondingly, the description words meeting the following conditions are screened from the description word set to be classified based on the prediction result in step 103: the predicted category to which the descriptor belongs is consistent with the category to which the descriptor sample having the shortest distance to the descriptor belongs, and the method specifically includes:

screening descriptors which meet the following conditions from the first positive candidate sample set to form a second positive candidate sample set: the category of the descriptor is consistent with the category of the descriptor sample with the shortest distance between the descriptors;

screening descriptors which meet the following conditions from the first negative candidate sample set to form a second negative candidate sample set: the category of the descriptor is consistent with the category of the descriptor sample with the shortest distance between the descriptors.

The distance between the descriptors can be calculated by any algorithm capable of calculating the distance between the words, such as an euclidean algorithm or a cosine algorithm, which is not described in detail herein.

That is to say, before descriptors which can be used as descriptor samples are screened from the descriptor set to be classified according to the prediction result, descriptors which are high-quality samples or poor-quality samples are preliminarily screened from the descriptor set to be classified on the basis of the comprehensive evaluation index of each descriptor set to be classified, so that corresponding descriptors which can be used as descriptor samples are determined on the basis of the preliminarily screened descriptors, and the purposes of reducing the complexity of subsequent data processing and improving the efficiency of descriptor classification are achieved.

Further optionally, descriptors meeting the following conditions are screened out from the set of descriptors to be classified: after the predicted category to which the descriptor belongs is consistent with the category to which the descriptor sample having the shortest distance from the descriptor belongs, before the step 104 of adding the screened descriptor into the descriptor sample set in a manner that the corresponding category is the predicted category to which the descriptor belongs to obtain the descriptor, and obtaining an updated descriptor sample set, the method may further include:

and screening a third set number (flexibly set) of descriptors of which the shortest distance is not more than a set first distance threshold value (flexibly set according to the third set number) from the second positive candidate sample set according to the shortest distance between each descriptor and each descriptor sample, and screening a fourth set number (flexibly set) of descriptors of which the shortest distance is not more than a set second distance threshold value (flexibly set according to the fourth set number) from the second negative candidate sample set as finally screened descriptors.

In order to ensure the balance of the selection of the positive and negative samples, the third set number may be equal to the fourth set number, and of course, may also be different from each other, which is not described in detail herein.

That is, when a descriptor that can be used as a descriptor sample is screened from the set of descriptors to be classified, whether the category to which the descriptor belongs is consistent with the category to which the descriptor sample with the shortest distance to the descriptor belongs or not is considered, and whether the distance between the descriptor and the descriptor sample with the shortest distance to the descriptor meets a set distance requirement or not needs to be considered, so that the accuracy of the screened descriptor that can be used as the descriptor sample is ensured.

Further optionally, the method may further comprise:

if the updated classification model meets the following conditions, the classification model is not updated, and a result obtained by classifying each descriptor to be classified in the descriptor set to be classified based on the classification model meeting the following conditions is used as a final classification result:

the classification precision is not less than the set precision threshold value, and/or the updating times are not less than the set times threshold value.

That is to say, the classification operation may be performed on the set of descriptors to be classified based on the set classification model until it is determined that the classification accuracy of the updated classification model is not less than the set accuracy threshold, and/or it is determined that the update frequency of the set classification model is not less than the set frequency threshold, and a result obtained by classifying each descriptor to be classified in the set of descriptors to be classified based on the updated classification model is taken as a final classification result. Namely, on the basis of self-learning of the descriptor classification model, the learning effect of the classification model is restrained by setting the classification precision threshold and/or the updating frequency threshold of the classification model, and the classification precision of the classification model and the accuracy of the classification result are improved.

As can be seen from the above, the method for classifying descriptors according to the embodiments of the present application can be specifically implemented by the following steps:

step C1: classifying each descriptor to be classified in the descriptor set to be classified based on a set classification model, and determining a comprehensive evaluation index of each descriptor to be classified and a category of each descriptor to be classified, such as a category identifier such as a corresponding category table number;

optionally, the comprehensive evaluation index of the descriptor to be classified may be obtained by performing weighted summation on the feature data of the descriptor to be classified according to the weight corresponding to each feature data. The weight corresponding to each feature data may be obtained by training in the process of training the classification model according to the feature data of each descriptor sample in the descriptor sample set and the category corresponding to each descriptor sample, which is not described herein again.

Step C2: screening out a first set number of descriptors to be classified, of which the comprehensive evaluation index is not less than a set third index threshold value, according to the comprehensive evaluation index of each descriptor to be classified to form a first positive candidate sample set; screening out a second set number of descriptors to be classified, of which the comprehensive evaluation index is not greater than a set fourth index threshold value, so as to form a first negative candidate sample set;

it should be noted that the number of the screened first positive candidate samples and the number of the screened first negative candidate samples may be the same, so as to ensure the balance of the positive and negative samples of the descriptor; the first set quantity and the second set quantity can be flexibly set according to actual conditions, and if the first set quantity and the second set quantity are larger, the calculated quantity in the screening process of the subsequent steps is larger; if the first set number and the second set number are smaller, the number of candidate samples obtained by screening at one time is smaller, and details are not repeated here;

step C3: respectively calculating the distance (such as Euclidean distance, cosine distance and the like) between each descriptor to be classified screened out in the step C2 and each descriptor sample, and screening out descriptors meeting the following conditions from the first positive candidate sample set to form a second positive candidate sample set: the predicted class table number of the descriptor is consistent with the class table number of the descriptor sample with the shortest distance between the descriptors; and

and screening descriptors meeting the following conditions from the first negative candidate sample set to form a second negative candidate sample set: the predicted class table number of the descriptor is consistent with the class table number of the descriptor sample with the shortest distance between the descriptors;

it should be noted that, the shortest distance between the descriptor to be classified and a descriptor sample indicates that the similarity between the descriptor to be classified and the descriptor sample is the highest, and if the class numbers of the descriptor to be classified and the descriptor sample are consistent, the correctness of the prediction (classification) result is further ensured;

step C4: sorting the descriptors in the second positive candidate sample set in an ascending order according to the shortest distance between the descriptors and the descriptor samples, screening out descriptors with the first third set number (which can be flexibly set), and adding the screened descriptors together with information such as the class table number of the descriptors obtained by the current prediction into the descriptor positive sample set; and the number of the first and second groups,

sorting the descriptors in the second negative candidate sample set in an ascending order according to the shortest distance between the descriptors and the descriptor samples, screening out descriptors with the first fourth set number, and adding the screened descriptors together with information such as the class table number of the descriptors obtained by the current prediction into the descriptor negative sample set;

step C5: deleting the descriptors screened out in the step C4 from the descriptor set to be classified, and updating the classification model based on the updated descriptor sample set;

step C6: repeating the steps C1-C5 until the updating times of the classification model reach a set time threshold (which can be flexibly set, such as 5 times or 10 times), and/or determining that the classification precision of the updated classification model is not less than the set precision threshold (which can be flexibly set, such as 88% or 90%), so as to obtain the classification model with the robustness and the classification precision meeting the requirements, and executing the screening result obtained in the step C1 based on the classification model meeting the classification precision and/or updating times conditions to be used as a final descriptor classification result.

That is to say, on the basis of self-learning of the descriptor classification model, the self-learning condition of the classification model is constrained by setting the comprehensive evaluation index threshold of the descriptor and the spatial position relationship between the descriptor to be classified and the descriptor sample, and a better balance is obtained between the informativeness and the prediction accuracy of the sample, so that the robustness and the classification accuracy of the classification model are further improved.

It should be noted that the method provided in this embodiment can be implemented on an ODPS (distributed mass data processing) platform based on computer programming languages such as HIVEQL and PYTHON, and details of this embodiment are not described herein again.

In summary, the descriptor classification method provided in this embodiment may select a descriptor with the most information content from a large number of unclassified descriptors in a loop iteration manner to automatically mark the descriptor, and update the descriptor to an existing descriptor sample set, so as to expand a training set of a classification model and improve robustness and classification accuracy of the classification model, thereby improving accuracy of a descriptor classification result on the basis of saving human resource consumption.

In addition, the self-learning condition of the classification model can be restrained by setting the comprehensive evaluation index threshold of the descriptors and the spatial position relation between the descriptors to be classified and the descriptor samples, so that the robustness and the classification precision of the classification model and the accuracy of the classification result are further improved.

Example two:

based on the same inventive concept, a descriptor classifying device is provided in the second embodiment of the present application, and specifically, as shown in fig. 2, it is a schematic structural diagram of the device in the second embodiment of the present application, and the device includes:

the data acquisition module 201 is configured to determine a set of descriptors to be classified and feature data of each descriptor to be classified in the set of descriptors to be classified;

the classification module 202 is configured to classify each descriptor to be classified in the descriptor set to be classified based on a set classification model, and predict a category of each descriptor to be classified; the set classification model is obtained by training according to the feature data of each descriptor sample in the descriptor sample set and the category corresponding to each descriptor sample; and are

That is, the descriptor classifying device provided in this embodiment may first obtain a classification model based on feature data of each descriptor sample and class training corresponding to each descriptor sample; and then classifying the descriptors to be classified based on the classification model, updating the descriptor sample set according to the obtained classification result, updating the classification model based on the updated descriptor sample set, and classifying the descriptors to be classified in the descriptor set to be classified based on the updated classification model. That is to say, the descriptors with the most information content can be selected from a large number of unclassified descriptors in a loop iteration mode to be automatically marked and updated into the existing descriptor sample set, so that the training set of the classification model is expanded, the robustness and the classification precision of the classification model are improved, and the accuracy of the descriptor classification result can be improved on the basis of saving the human resource consumption.

Optionally, the classification module 202 is further configured to:

Optionally, the feature data of each descriptor comprises first feature data for characterizing the self-attribute of the descriptor and second feature data for characterizing the features of the user associated with the descriptor.

That is to say, in this embodiment, when the device classifies the descriptors, not only the characteristics of the descriptors themselves but also the characteristics of the user associated with the descriptors may be considered, so that the selection dimension of the feature data of the descriptors is expanded, and the accuracy of subsequent descriptor classification is improved.

Further, the first characteristic data of each descriptor comprises any one or more of the traffic path source proportion, the tonality characteristic, the quality characteristic, the qualification characteristic, the popularity characteristic and the price level characteristic of the descriptor;

the second characteristic data of each descriptor includes any one or more of a number ratio of users associated with the descriptor at each age level, a number ratio at each gender level, a number ratio at each purchasing power level, a number ratio at each discount rate level, a number ratio at each activity level, a number ratio at each buy-back rate level associated with the descriptor, and a number ratio at each return rate level associated with the descriptor.

Further optionally, in order to reduce complexity of data processing and further improve accuracy of descriptor classification, in this embodiment, when determining second feature data of each descriptor, which is used for characterizing features of a user associated with the descriptor, the data acquisition module 201 may generally only determine the second feature data used for characterizing features of key users associated with the descriptor, where the key users of each descriptor refer to users whose heat degree associated with the descriptor is not less than a set heat degree threshold (which may be flexibly set), and details of which are not described herein are omitted.

Optionally, the descriptor sample in the descriptor sample set includes a descriptor positive sample and a descriptor negative sample, where:

the descriptor negative sample refers to a descriptor sample of which the comprehensive evaluation index is not higher than a set second index threshold (can be flexibly set); wherein the second exponent threshold is not higher than the first exponent threshold; and, the comprehensive evaluation index of each descriptor is a parameter for characterizing the performance of the descriptor determined according to the feature data of the descriptor (for example, the parameter may be obtained by weighting and summing the feature data of the descriptor according to the weight corresponding to each feature data, which is not limited).

Optionally, the classification module 202 is further configured to, in screening out descriptors meeting the following conditions from the set of descriptors to be classified: determining the comprehensive evaluation index of each descriptor to be classified in the descriptor set to be classified before the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptor and the descriptor; and are

Accordingly, the classification module 202 may be specifically configured to screen descriptors from the first positive candidate sample set, so as to form a second positive candidate sample set, where the descriptors satisfy the following conditions: the category of the descriptor is consistent with the category of the descriptor sample with the shortest distance between the descriptors; and the number of the first and second groups,

Optionally, the classifying module 202 may be further configured to screen descriptors, which satisfy the following conditions, from the set of descriptors to be classified: after the predicted category to which the descriptor belongs is consistent with the category to which the descriptor sample having the shortest distance to the descriptor belongs, before adding the screened descriptors into the descriptor sample set and obtaining an updated descriptor sample set, from the second positive candidate sample set, a third set number (flexibly settable) of descriptors whose shortest distance is not greater than the set first distance threshold (flexibly settable according to the third set number, and screening a fourth set number (flexibly settable) of descriptors of which the shortest distance is not more than a set second distance threshold value (flexibly settable according to the fourth set number) from the second negative candidate sample set as finally screened descriptors.

In summary, the descriptor classifying device provided in this embodiment may select a descriptor with the most information content from a large number of unclassified descriptors in a loop iteration manner to automatically mark the descriptor, and update the descriptor to an existing descriptor sample set to expand a training set of a classification model and improve robustness and classification accuracy of the classification model, so that accuracy of a descriptor classification result may be improved on the basis of saving human resource consumption.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device), or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of categorizing descriptors, the method comprising:

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 1, wherein the feature data of each descriptor includes first feature data for characterizing an own attribute of the descriptor and second feature data for characterizing a feature of a user associated with the descriptor.

4. The method of claim 3, wherein the first characteristic data of each descriptor includes any one or more of traffic path source proportion, tonality characteristic, quality characteristic, qualification characteristic, popularity characteristic, and price level characteristic of the descriptor;

5. The method as claimed in claim 3, wherein the user associated with the descriptor for each descriptor is a user whose heat degree associated with the descriptor is not less than a set heat degree threshold, and the heat degree is obtained by weighting and summing the historical operation behavior data associated with the descriptor for each concerned user based on the historical operation behavior data associated with the descriptor for each concerned user of the descriptor and the weight corresponding to each historical operation behavior data.

6. The method of claim 1, wherein the descriptor samples in the set of descriptor samples comprise descriptor positive samples, and descriptor negative samples, wherein:

the descriptor positive sample is a descriptor sample of which the comprehensive evaluation index is not lower than a set first index threshold;

the descriptor negative sample is a descriptor sample of which the comprehensive evaluation index is not higher than a set second index threshold; wherein the second exponent threshold is not higher than the first exponent threshold; and the comprehensive evaluation index of each descriptor is a parameter which is determined according to the characteristic data of the descriptor and is used for characterizing the performance of the descriptor.

7. The method of claim 6, wherein descriptors satisfying the following conditions are screened from the set of descriptors to be classified: before the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptors, the method further comprises the following steps:

screening out a first set number of descriptors to be classified, of which the comprehensive evaluation index is not less than a set third index threshold value, to form a first positive candidate sample set and a second set number of descriptors to be classified, of which the comprehensive evaluation index is not more than a set fourth index threshold value, to form a first negative candidate sample set according to the comprehensive evaluation index of each descriptor to be classified, wherein the fourth index threshold value is not higher than the third index threshold value;

based on the prediction result, selecting descriptors which meet the following conditions from the descriptor set to be classified: the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptor and the descriptor, and the method specifically comprises the following steps:

8. The method of claim 7, wherein descriptors satisfying the following conditions are screened from the set of descriptors to be classified: after the obtained category of the descriptor is predicted to be consistent with the category of the descriptor sample with the shortest distance between the descriptor and the descriptor, before the screened descriptor is added into the descriptor sample set to obtain an updated descriptor sample set, the method further comprises the following steps:

and screening a third set number of descriptors of which the shortest distance is not more than a set first distance threshold value from the second positive candidate sample set according to the shortest distance between each descriptor and each descriptor sample, and screening a fourth set number of descriptors of which the shortest distance is not more than a set second distance threshold value from the second negative candidate sample set to serve as finally screened descriptors.

9. A descriptor sorting apparatus, characterized in that the apparatus comprises:

10. The device of claim 9, wherein the classification module is further to:

11. The apparatus of claim 9, wherein the feature data of each descriptor includes first feature data for characterizing an own attribute of the descriptor and second feature data for characterizing a feature of a user associated with the descriptor.

12. The apparatus of claim 11, wherein the first characteristic data of each descriptor includes any one or more of traffic path source proportion, tonality characteristic, quality characteristic, qualification characteristic, popularity characteristic, and price level characteristic of the descriptor;

13. The device as claimed in claim 11, wherein the user associated with the descriptor for each descriptor is a user whose heat degree associated with the descriptor is not less than a set heat degree threshold, and the heat degree is obtained by weighting and summing the historical operation behavior data associated with the descriptor for each user concerned based on the historical operation behavior data associated with the descriptor for each user concerned and the weight corresponding to each historical operation behavior data.

14. The apparatus of claim 9, wherein the descriptor samples in the set of descriptor samples comprise descriptor positive samples, and descriptor negative samples, wherein:

15. The apparatus of claim 14,

the classification module is also used for screening out descriptors meeting the following conditions from the descriptor set to be classified: determining the comprehensive evaluation index of each descriptor to be classified in the descriptor set to be classified before the predicted belonged category of the descriptor is consistent with the belonged category of the descriptor sample with the shortest distance between the descriptor and the descriptor; and are

the classification module is specifically configured to screen descriptors that satisfy the following conditions from the first positive candidate sample set to form a second positive candidate sample set: the category of the descriptor is consistent with the category of the descriptor sample with the shortest distance between the descriptors; and the number of the first and second groups,

16. The apparatus of claim 15,

the classification module is further used for screening descriptors which meet the following conditions from the descriptor set to be classified: after the obtained category of the descriptor is predicted to be consistent with the category of the descriptor sample with the shortest distance to the descriptor, before the selected descriptor is added into the descriptor sample set to obtain an updated descriptor sample set, according to the shortest distance between each descriptor and each descriptor sample, selecting a third set number of descriptors with the shortest distance not greater than a set first distance threshold from the second positive candidate sample set, and selecting a fourth set number of descriptors with the shortest distance not greater than a set second distance threshold from the second negative candidate sample set as the finally selected descriptor.