CN112528159B

CN112528159B - Feature quality assessment method and device, electronic equipment and storage medium

Info

Publication number: CN112528159B
Application number: CN202011554160.2A
Authority: CN
Inventors: 李小聪; 魏龙; 王召玺; 王峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2024-03-26
Anticipated expiration: 2040-12-24
Also published as: CN112528159A

Abstract

The application discloses a feature quality assessment method, a device, electronic equipment and a storage medium, and relates to the technical field of neural networks. The specific scheme is as follows: extracting at least one field actually used in a sample to be evaluated; converting at least one field actually used to obtain a characteristic signature corresponding to the sample to be evaluated; inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model; at least two evaluation indexes of each feature of the sample to be evaluated are calculated based on the weight values of each feature of the sample to be evaluated. According to the embodiment of the application, the characteristic quality of the model can be evaluated from multiple dimensions, the evaluation index is more comprehensive, and the readability is stronger.

Description

Feature quality assessment method and device, electronic equipment and storage medium

Technical Field

The application relates to the field of artificial intelligence, and further relates to the technical field of neural networks, in particular to a feature quality assessment method, a device, electronic equipment and a storage medium.

Background

The recommendation system is used for constructing an interest model of a person and an article offline by machine learning under the application scene of the big data field or the artificial intelligence field and combining the information of the user, the information of the article and the information of the previous interaction behavior of the user on the article, and the online recommendation is combined with the offline machine model and a recommendation algorithm to realize accurate personalized recommendation. From the implementation architecture, the recommendation system is mainly divided into two parts of online recommendation and offline model training. And performing offline model training to finish the splicing and cleaning of the online recommendation log and the user log, generating corresponding training data according to the feature operator, and performing model training by using a machine learning algorithm to produce a corresponding machine learning model. Typically, the offline training yield includes the following two parts: a portion is a discrete model stored in a remote storage medium, including historical data of all feature values; the other part is a small model matched with the discrete model, and the most important network structure information of the new model trained by the model iteration upgrading is stored in the small model, such as weight and bias.

Currently, the industry has general characteristic quality evaluation indexes, and the specific process is as follows: and extracting a characteristic signature (feasign) from the sample data obtained on line through a characteristic operator, inquiring a corresponding weight value (weight) from a discrete model through the feasign, splicing the weight value (weight) into the input of the model, and completing the calculation of an on-line estimation service by combining the small model network node weight and the offset to obtain an estimated Q value (namely an estimated CTR value). However, the method is not friendly to the combination features of multistage nesting, is difficult to systematically and comprehensively evaluate the influence surface of data to the features to the model, and meanwhile, is difficult to explain specific reasons of feature quality due to too coarse evaluation indexes, so that a reliable and effective direction is provided for model feature level optimization, and the problem of Q value fluctuation on a positioning line is assisted.

Disclosure of Invention

The method, the device, the electronic equipment and the storage medium for evaluating the feature quality can evaluate the feature quality of the model from multiple dimensions, and are more comprehensive in evaluation index and higher in readability.

In a first aspect, the present application provides a feature quality assessment method, the method comprising:

extracting at least one field actually used in a sample to be evaluated;

converting the at least one field actually used to obtain a characteristic signature corresponding to the sample to be evaluated;

inputting the characteristic signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each characteristic of the sample to be evaluated through the offline model;

and calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weight value of each characteristic of the sample to be evaluated.

In a second aspect, the present application provides a feature quality assessment apparatus, the apparatus comprising: the device comprises an extraction module, a conversion module, a first calculation module and a second calculation module; wherein,

the extraction module is used for extracting at least one field actually used in the sample to be evaluated;

the conversion module is used for carrying out conversion treatment on the at least one field which is actually used to obtain a characteristic signature corresponding to the sample to be evaluated;

the first calculation module is used for inputting the characteristic signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each characteristic of the sample to be evaluated through the offline model;

the second calculation module is used for calculating at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated.

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the feature quality assessment method described in any of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a storage medium having stored thereon a computer program which, when executed by a processor, implements the feature quality assessment method described in any of the embodiments of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the feature quality assessment method according to any of the embodiments of the present application.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a schematic flow chart of a feature quality evaluation method according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a mapping relationship among a model, a feature and a sample according to an embodiment of the present application;

FIG. 3 is a second flow chart of a feature quality assessment method according to an embodiment of the present disclosure;

FIG. 4 is a schematic illustration of coverage and void fraction of samples provided by embodiments of the present application;

FIG. 5 is a schematic diagram of statistics of samples provided by embodiments of the present application;

FIG. 6 is a schematic diagram of the influence of sample fields and feature quality problems on a model provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a feature quality evaluation device provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of a conversion module according to an embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a feature quality evaluation method of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example 1

Fig. 1 is a schematic flow chart of a feature quality evaluation method provided in an embodiment of the present application, where the method may be performed by a feature quality evaluation apparatus or an electronic device, and the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated into any intelligent device with a network communication function. As shown in fig. 1, the feature quality evaluation method may include the steps of:

s101, at least one field actually used is extracted from a sample to be evaluated.

In this step, the electronic device may extract at least one field of actual use in the sample to be evaluated. Specifically, the electronic device may first convert the sample to be evaluated into structured data, and then extract at least one field actually used in the structured data. The field actually used in the embodiment of the present application refers to a field whose content in the sample to be evaluated is not empty, or one or more fields that are predetermined. For example, assume that a certain sample under evaluation may include, but is not limited to, the following fields: commodity name, commodity type, commodity model, customer order time, mailing time, feedback/evaluation information, etc.; one or more of the above fields may be determined by the present application as a field that is actually used.

In machine learning, a sentence is streamed, features and data determine the upper limit of machine learning, and models and algorithms only approximate this upper limit. In order to continuously reach a new upper limit due to the continuous reduction of calculation and storage cost in recent years, the training data volume of the machine learning model is spanned from G to P level, the number of features and the width and depth of feature combinations are all in a new upper limit, the differences of thousands of people and thousands of faces are drawn as far as possible, the machine learning model learns the more comprehensive and more real requirements of users, and reasoning more desirable to the users is carried out. For example, the prior user behavior related characteristics consider the click history behavior in a week in three days at most, but the click history behavior is relaxed to a month or even 3 months in the past; in the process of matching the features, ten types of interaction data on the user and resource sides can be considered; such as quality-related combined features, can be considered from tens of dimensions of title and text, novelty, title party, style, relevance, grammar, structure, attractiveness, information content, etc. Taking features of a DNN model of a rank layer in a recommendation system as an example, the features of model input are less than 500, but there are more than two thousands of model input after the model input is unfolded according to dependent sample fields.

S102, converting at least one field actually used to obtain a characteristic signature corresponding to the sample to be evaluated.

In this step, the electronic device may perform conversion processing on at least one field actually used to obtain a feature signature corresponding to the sample to be evaluated. Specifically, the electronic device may first splice and combine at least one actually used field to obtain a plaintext corresponding to the at least one actually used field; and then, carrying out a preset signature algorithm on a plaintext corresponding to at least one field actually used to obtain a characteristic signature corresponding to the sample to be evaluated.

S103, inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model.

In this step, the electronic device may input the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and output the weight value of each feature of the sample to be evaluated through the offline model. Preferably, before the feature signature corresponding to the sample to be evaluated is input into the pre-trained offline model, the electronic device may further input the corresponding relationship between the predetermined at least one sample field and the at least one online pre-estimated model, and the corresponding relationship between the at least one online pre-estimated model and the features of each online pre-estimated model into the offline model; wherein the corresponding relationship between the at least one sample field and the at least one online pre-estimation model is a many-to-many relationship; the corresponding relation between the characteristics of each online pre-estimated model and the characteristics of each online pre-estimated model is a many-to-one relation.

Because of the multi-level nesting of features, single features, combined features and intermediate features (which are not directly input into the model) are distinguished, and each model has hundreds or thousands of features (including single features and combined features), in order to explore the influence of features on the model effect, the mapping relationship between model-feature-sample needs to be clarified first. The quality of the sample fields will affect the quality of the model features, which will affect the model effect, and since the number of sample fields is very large, only the features that participate in the model feature calculation will affect the model effect.

Fig. 2 is a schematic structural diagram of mapping relationships among models, features and samples provided in an embodiment of the present application. As shown in fig. 2, when the corresponding relationship between the sample field and the online pre-estimation model is a many-to-many relationship; the corresponding relation between the characteristics of the online pre-estimated model and the online pre-estimated model is a many-to-one relation; the corresponding relation between the sample field and the characteristic of the online pre-estimated model is a many-to-many relation.

S104, calculating at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated.

In this step, the electronic device may calculate at least two evaluation indexes of each feature of the sample to be evaluated based on the weight values of each feature of the sample to be evaluated. Specifically, the at least two evaluation indexes include at least two of the following: sample consistency, sample coverage, sample empty rate, sample consistency, sample statistics, sample frequency of use, characteristic signature duty ratio after de-duplication, and characteristic importance.

By adopting the characteristic quality assessment method provided by the application, the assessment index is more comprehensive: and a comprehensive evaluation index system is built from the model-dependent samples, the characteristics and the intermediate yield feast, and the whole system has details. Feature importance may be drilled from offline AUC of model or online point spread ratio (CTR model), feast duty cycle after deduplication, and then drilled down to consistency of sample fields, coverage & null rate, distribution/thickening row, frequency of use and statistics of sample fields (mean, maximum, minimum, variance, median, etc.). The readability of the evaluation index is stronger: the evaluation indexes are related layer by layer, the fluctuation energy of the integral index is solved in the detail index, and the optimization of the detail index can be reflected in the optimization of the integral index; particularly for multi-level nesting, when the Q value fluctuates, the combination features of the multi-feature combination can be compared with the fluctuation of each dimension index of the feature, and the abnormal sample field can be rapidly positioned; when the model features are redundant, an effective optimization scheme and the like can be rapidly given by comparing index data of each feature.

According to the technical scheme, the characteristic quality of the model can be estimated from multiple dimensions, the estimated index is more comprehensive, and the readability is stronger.

The feature quality assessment method provided by the embodiment of the application comprises the steps of firstly extracting at least one truly used field from a sample to be assessed; then converting at least one field actually used to obtain a characteristic signature corresponding to the sample to be evaluated; inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model; and calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weight value of each characteristic of the sample to be evaluated. That is, the present application can calculate at least two evaluation indexes of each feature of a sample to be evaluated by based on the weight values of each feature of the sample to be evaluated. In the existing feature quality evaluation method, the feature quality can only be evaluated according to the estimated Q value. Because the technical means of calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weight value of each characteristic of the sample to be evaluated is adopted, the technical problems that in the prior art, the characteristic quality can only be evaluated according to the estimated Q value, the evaluation indexes are too single, and for combining complex characteristics, comprehensive index evaluation with finer granularity is lacking are overcome; in addition, the technical scheme of the embodiment of the application is simple and convenient to realize, convenient to popularize and wider in application range.

Example two

Fig. 3 is a second flow chart of the feature quality evaluation method according to the embodiment of the present application. Further optimization and expansion based on the above technical solution can be combined with the above various alternative embodiments. As shown in fig. 3, the feature quality evaluation method may include the steps of:

s301, at least one field actually used is extracted from a sample to be evaluated.

S302, splicing and combining at least one field actually used to obtain a plaintext corresponding to the at least one field actually used.

S303, subjecting the plaintext corresponding to at least one actually used field to a preset signature algorithm to obtain a characteristic signature corresponding to the sample to be evaluated.

S304, inputting the characteristic signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each characteristic of the sample to be evaluated through the offline model.

S305, inputting the weight value of each characteristic of the sample to be evaluated into a pre-trained index evaluation model, and outputting at least two evaluation indexes of each characteristic of the sample to be evaluated through the index evaluation model.

In this step, the electronic device may input the weight values of the respective features of the sample to be evaluated into the index evaluation model trained in advance, and output at least two evaluation indexes of the respective features of the sample to be evaluated through the index evaluation model; wherein the at least two evaluation metrics include at least two of: sample consistency, sample coverage, sample empty rate, sample consistency, sample statistics, sample frequency of use, characteristic signature duty ratio after de-duplication, and characteristic importance. Specifically, the electronic device may first extract a node from the pre-trained index evaluation network as a current node; taking the characteristic signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; and inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each feature of the sample to be evaluated output by the last node in the pre-trained index evaluation network are obtained.

The following describes the characteristic quality evaluation index of the specific model in detail:

1) Consistency of samples: the consistency of the sample and the feature between the online and the offline is that the online and the offline data sources and the transmission in the recommendation system are mutually independent, so that the inconsistency is likely to exist, and the difference between the online and the offline data can cause that some features cannot learn the corresponding large model data in the offline training, so that the difference between the online estimated result and the real expected value is caused.

2) Coverage of samples and empty rate of samples: the coverage of the samples and the null rate of the samples determine to a large extent the quality of a sample field. Fig. 4 is a schematic diagram of coverage and void fraction of samples provided by embodiments of the present application. As shown in fig. 4, the higher the coverage, the lower the null rate, the sample field is generally considered to be a good sample, which can help us to more accurately characterize, and we also verify this conclusion in multiple on-line fluctuations.

3) Sample density: for non-numeric fields, dense rows of sample fields can be traversed by calculating top value distributions and duty cycles; for example, a field with 99% value of one value and the remaining 1% value of the other value is not recommended for direct use.

4) Statistics of samples: for the numerical sample field, the value difference of the sample can be evaluated in an auxiliary manner through indexes such as a maximum value, a minimum value, a mean value, a median value, a variance and the like. Fig. 5 is a schematic diagram of statistics of samples provided in an embodiment of the present application. As shown in fig. 5, for 2020-08-01, 23:00:00, the median is-1.0000; the average value is-0.8239; the standard deviation is: 0.7283; the minimum value is: -1.0000; the maximum value is as follows: 3.0000.

5) Frequency of use of samples: the frequency of use of the sample field reflects the influence surface of the field on the model effect from the other surface.

6) The duty cycle of the signature: the feature signature (feasign) duty ratio depends on the coverage rate of the sample to a certain extent, and the higher the coverage rate of the feature is, the less the value is missing, and the storage cost of the feature is also reflected on the side.

7) The duty ratio of the characteristic signature after duplication removal: the feature coverage rate can be seen by the feasign duty ratio, but the coverage rate can not be shown for some fixed values, null values and dense value distribution, and the feasign duty ratio after duplication removal can enable us to more accurately feel the richness and the diversity of the feature values, and simultaneously can also more intuitively see the feature optimization direction.

8) Importance of features: this is a conventional evaluation index, and the more important the feature is explained by randomly replacing the input and output AUCs of a given feature and then comparing the output AUCs with the original AUCs. AUC (Area Under Curve) is defined as the area under the ROC curve enclosed by the coordinate axes, it being clear that this area will not have a value greater than 1. Further, since the ROC curve is generally above the line y=x, the AUC has a value ranging between 0.5 and 1.

The scheme adopted in the industry at present has single evaluation index, for the complex combination of the characteristics, the comprehensive evaluation index of fine granularity and coarse granularity is lacking, the influence of a certain single point can only be evaluated from the global evaluation, meanwhile, the optimization scheme aiming at the evaluation result is also very single, namely, the characteristics with low cutting contribution factors can not be traced to the root cause of the problem through the result, the quality of the characteristics can not be improved from the source, the alternative high-quality sample fields can not be searched for by the sample fields which can not be optimized, if the optimization scheme is not satisfied, the cutting scheme is adopted again, the low-efficiency characteristics are cut based on the cost and the benefit, the health iteration of the model characteristics is ensured, and meanwhile, the Q value fluctuation problem caused by the quality of the characteristics is not beneficial to the positioning analysis and the solution. The method and the device can evaluate the feature quality of the model from multiple dimensions, for example, the correctness (such as consistency and statistic value) of a sample field, the completeness (coverage rate and null rate), the variability (density, statistic value and the like), the variability (duty ratio after deduplication) and the completeness (duty ratio) of feasign, the data quality can be reflected from different layers, the positive correlation exists with the importance of the feature, the amplification factor of the field quality problem in the model is reflected by the sample field frequently, and the field with higher affected feature importance should be important attention of data management, data monitoring and the like. The sample field and the characteristic quality problem are transmitted to the model, so that the model effect is affected, and the model is also caused to be bloated.

Fig. 6 is a schematic diagram of the influence of the sample field and the feature quality problem on the model according to the embodiment of the present application. As shown in fig. 6, problems that may occur with the sample field include: data duplication, data loss, calculation errors, definition errors; problems that may occur with feature quality include: inconsistent features, incomplete features, non-compliance of features, uncontrollable features and redundant features; problems that may occur with the model include: inaccurate and bloated models.

Example III

Fig. 7 is a schematic structural diagram of a feature quality evaluation device provided in an embodiment of the present application. As shown in fig. 7, the apparatus 700 includes: an extraction module 701, a conversion module 702, a first calculation module 703 and a second calculation module 704; wherein,

the extracting module 701 is configured to extract at least one field actually used in the sample to be evaluated;

the conversion module 702 is configured to perform conversion processing on the at least one field that is actually used to obtain a feature signature corresponding to the sample to be evaluated;

the first calculation module 703 is configured to input a feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and output a weight value of each feature of the sample to be evaluated through the offline model;

the second calculating module 704 is configured to calculate at least two evaluation indexes of each feature of the sample to be evaluated based on the weight values of each feature of the sample to be evaluated.

Further, the at least two evaluation indicators include at least two of: sample consistency, sample coverage, sample empty rate, sample consistency, sample statistics, sample frequency of use, characteristic signature duty ratio after de-duplication, and characteristic importance.

Fig. 8 is a schematic structural diagram of a conversion module according to an embodiment of the present application. As shown in fig. 8, the conversion module 702 includes: a splice combining sub-module 7021 and a calculating sub-module 7022; wherein,

the splicing and combining submodule 7021 is configured to splice and combine the at least one field that is actually used to obtain a plaintext corresponding to the at least one field that is actually used;

the computing submodule 7022 is configured to obtain a feature signature corresponding to the sample to be evaluated by performing a preset signature algorithm on a plaintext corresponding to the at least one field that is actually used.

Further, the first calculation module 703 is further configured to input a predetermined correspondence between at least one sample field and at least one online pre-estimation model, and a correspondence between the at least one online pre-estimation model and a feature of each online pre-estimation model into the offline model; wherein the correspondence between the at least one sample field and the at least one online pre-estimation model is a many-to-many relationship; the corresponding relation between the characteristics of each online pre-estimated model and each online pre-estimated model is a many-to-one relation.

Further, the second calculating module 704 is specifically configured to input the weight value of each feature of the sample to be evaluated into a pre-trained index evaluation model, and output at least two evaluation indexes of each feature of the sample to be evaluated through the index evaluation model.

Further, the second calculating module 704 is specifically configured to extract a node from the pre-trained index evaluation network as a current node; taking the characteristic signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; and inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated output by the last node in the pre-trained index evaluation network are obtained.

The characteristic quality evaluation device can execute the method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be referred to the feature quality evaluation method provided in any embodiment of the present application.

Example IV

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as the feature quality evaluation method. For example, in some embodiments, the feature quality assessment method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the feature quality evaluation method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the feature quality assessment method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of feature quality assessment, the method comprising:

extracting at least one field actually used in a sample to be evaluated;

calculating at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated;

wherein the at least two evaluation metrics include at least two of: sample consistency, sample coverage, sample empty rate, sample consistency, sample statistics, sample use frequency, characteristic signature duty ratio after de-duplication, and characteristic importance;

the converting the at least one field actually used to obtain a feature signature corresponding to the sample to be evaluated includes:

splicing and combining the at least one field which is actually used to obtain a plaintext corresponding to the at least one field which is actually used;

obtaining a characteristic signature corresponding to the sample to be evaluated by carrying out a preset signature algorithm on a plaintext corresponding to the at least one field actually used;

wherein the calculating at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated includes:

and inputting the weight value of each characteristic of the sample to be evaluated into a pre-trained index evaluation model, and outputting at least two evaluation indexes of each characteristic of the sample to be evaluated through the index evaluation model.

2. The method of claim 1, further comprising, prior to said inputting the signature of the feature corresponding to the sample under evaluation into a pre-trained offline model:

inputting a predetermined corresponding relation between at least one sample word segment and at least one online pre-estimated model and a corresponding relation between the at least one online pre-estimated model and the characteristics of each online pre-estimated model into the offline model; wherein the correspondence between the at least one sample field and the at least one online pre-estimation model is a many-to-many relationship; the corresponding relation between the characteristics of each online pre-estimated model and each online pre-estimated model is a many-to-one relation.

3. The method of claim 1, the inputting the weight values of the respective features of the sample to be evaluated into a pre-trained index evaluation model, outputting at least two evaluation indices of the respective features of the sample to be evaluated through the index evaluation model, comprising:

extracting a node from the pre-trained index evaluation network as a current node; taking the characteristic signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; and inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated output by the last node in the pre-trained index evaluation network are obtained.

4. A feature quality assessment apparatus, the apparatus comprising: the device comprises an extraction module, a conversion module, a first calculation module and a second calculation module; wherein,

the second calculation module is used for calculating at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated;

wherein, the conversion module includes: splicing and combining sub-modules and calculating sub-modules; wherein,

the splicing and combining sub-module is used for splicing and combining the at least one field which is actually used to obtain a plaintext corresponding to the at least one field which is actually used;

the computing submodule is used for obtaining a characteristic signature corresponding to the sample to be evaluated by a preset signature algorithm on a plaintext corresponding to the at least one field to be actually used;

the second calculation module is specifically configured to input a weight value of each feature of the sample to be evaluated into a pre-trained index evaluation model, and output at least two evaluation indexes of each feature of the sample to be evaluated through the index evaluation model.

5. The apparatus of claim 4, the first computing module further configured to input into the offline model a predetermined correspondence of at least one sample field to at least one online pre-estimation model, and a correspondence of the at least one online pre-estimation model to features of each online pre-estimation model; wherein the correspondence between the at least one sample field and the at least one online pre-estimation model is a many-to-many relationship; the corresponding relation between the characteristics of each online pre-estimated model and each online pre-estimated model is a many-to-one relation.

6. The apparatus of claim 4, the second computing module being specifically configured to extract a node in the pre-trained metric evaluation network as a current node; taking the characteristic signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; and inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated output by the last node in the pre-trained index evaluation network are obtained.

7. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3.

8. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-3.

9. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-3.