CN112528159A

CN112528159A - Characteristic quality evaluation method and device, electronic equipment and storage medium

Info

Publication number: CN112528159A
Application number: CN202011554160.2A
Authority: CN
Inventors: 李小聪; 魏龙; 王召玺; 王峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-03-19
Anticipated expiration: 2040-12-24
Also published as: CN112528159B

Abstract

The application discloses a feature quality evaluation method and device, electronic equipment and a storage medium, and relates to the technical field of neural networks. The specific scheme is as follows: extracting at least one actually used field from a sample to be evaluated; converting at least one actually used field to obtain a characteristic signature corresponding to a sample to be evaluated; inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model; and calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weight value of each characteristic of the sample to be evaluated. According to the embodiment of the application, the characteristic quality of the model can be evaluated from multiple dimensions, the evaluation index is more comprehensive, and the readability is stronger.

Description

Characteristic quality evaluation method and device, electronic equipment and storage medium

Technical Field

The application relates to the field of artificial intelligence, further relates to the technical field of neural networks, and particularly relates to a feature quality assessment method and device, electronic equipment and a storage medium.

Background

The recommendation system is used for establishing an interest model of people and articles by combining information of a user, information of the articles and information of previous interaction behaviors of the user on the articles by machine learning and offline under an application scene in a big data field or an artificial intelligence field, and realizing accurate personalized recommendation by combining an offline machine model and a recommendation algorithm through online recommendation. From the implementation architecture, the recommendation system is mainly divided into two parts of online recommendation and offline model training. And (3) completing splicing and cleaning of the online recommended logs and the user logs by offline model training, generating corresponding training data according to the characteristic operators, and then performing model training by using a machine learning algorithm to generate a corresponding machine learning model. Typically, the offline training outcome includes the following two parts: one part is a discrete model stored in a remote storage medium, and comprises historical data of all characteristic values; and the other part is a small model matched with the discrete model, and the most important network structure information of the new model trained by model iterative upgrade is stored in the small model, such as weight and bias.

Currently, there are general characteristic quality evaluation indexes in the industry, and the specific process is as follows: extracting a feature signature (feascope) from sample data acquired on line through a feature operator, inquiring corresponding weight values (weights) from the discrete model through feascope, splicing the weight values (weights) into input of the model, and completing calculation of on-line prediction service by combining the weight values and the offset of the small model network node to obtain a predicted Q value (namely a predicted CTR value). But the method is not friendly to the multi-level nested combination characteristics, is difficult to systematically and comprehensively evaluate the influence surface from the data to the characteristics to the model, and is difficult to explain the specific reasons of high and low characteristic quality because the evaluation indexes are too rough, so that a reliable and effective direction is provided for the optimization of the characteristic aspect of the model, and the problem of Q value fluctuation on an auxiliary positioning line is solved.

Disclosure of Invention

The application provides a feature quality evaluation method and device, electronic equipment and a storage medium, which can evaluate the feature quality of a model from multiple dimensions, and have more comprehensive evaluation indexes and stronger readability.

In a first aspect, the present application provides a feature quality assessment method, including:

extracting at least one actually used field from a sample to be evaluated;

converting the at least one truly used field to obtain a characteristic signature corresponding to the sample to be evaluated;

inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model;

and calculating at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated.

In a second aspect, the present application provides a feature quality assessment apparatus, the apparatus comprising: the device comprises an extraction module, a conversion module, a first calculation module and a second calculation module; wherein the content of the first and second substances,

the extraction module is used for extracting at least one actually used field from a sample to be evaluated;

the conversion module is used for converting the at least one truly used field to obtain a characteristic signature corresponding to the sample to be evaluated;

the first calculation module is used for inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model;

the second calculating module is configured to calculate at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated.

In a third aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the feature quality assessment method of any embodiment of the present application.

In a fourth aspect, the present application provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the feature quality assessment method according to any embodiment of the present application.

In a fifth aspect, the present application provides a computer program product, including a computer program, which when executed by a processor implements the feature quality assessment method according to any embodiment of the present application.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a first flowchart of a feature quality assessment method according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a mapping relationship among models, features, and samples provided in an embodiment of the present application;

FIG. 3 is a second flowchart of a feature quality assessment method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of coverage and null rates of a sample provided by an embodiment of the present application;

FIG. 5 is a graph illustrating sample statistics provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of the influence of sample fields and feature quality problems on a model provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a feature quality evaluation apparatus provided in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a conversion module provided in an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing the feature quality assessment method of the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Example one

Fig. 1 is a first flowchart of a feature quality assessment method provided in an embodiment of the present application, where the method may be performed by a feature quality assessment apparatus or an electronic device, where the apparatus or the electronic device may be implemented by software and/or hardware, and the apparatus or the electronic device may be integrated in any intelligent device with a network communication function. As shown in fig. 1, the feature quality evaluation method may include the steps of:

s101, extracting at least one actually used field from a sample to be evaluated.

In this step, the electronic device may extract at least one actually used field from the sample to be evaluated. Specifically, the electronic device may first convert the sample to be evaluated into structured data and then extract at least one actually used field in the structured data. The actually used field in the embodiment of the present application refers to a field in the sample to be evaluated, the content of which is not empty, or one or more predetermined fields. For example, assume that a sample to be evaluated may include, but is not limited to, the following fields: commodity name, commodity type, commodity model, client ordering time, mailing time, feedback/evaluation information and the like; the present application may determine one or more of the above fields as actually used fields.

In machine learning, a sentence, features and data determine the upper limit of machine learning, and models and algorithms only approximate this upper limit. The calculation and storage cost is continuously reduced in recent years, and in order to continuously reach a new upper limit, the training data volume of the machine learning model is spanned to the P level from G, the number of the features and the width and depth of the feature combination continuously challenge the new upper limit, the difference of thousands of people is drawn as much as possible, the machine learning model is made to learn the more comprehensive and more real requirements of users, and the reasoning more satisfactory to the mind of the users is made. For example, the historical click behaviors within three days and a week are considered at most by the related characteristics of the user behaviors, but the historical click behaviors within one month or even 3 months are released; in the process of matching the characteristics, ten types of interactive data on the user and the resource side may be considered; for example, the combined features related to quality can be considered from the old text of the title and the text, and dozens of dimensions such as novelty, title party, style, relevance, grammar, structure, attractiveness, information amount, and the like. Taking the characteristics of the DNN model of rank layer in the recommendation system as an example, the characteristics of model input are less than 500, but two thousand characteristics are generated after expansion according to the dependent sample field.

And S102, converting at least one actually used field to obtain a characteristic signature corresponding to the sample to be evaluated.

In this step, the electronic device may perform conversion processing on at least one actually used field to obtain a feature signature corresponding to the sample to be evaluated. Specifically, the electronic device may first splice and combine at least one actually used field to obtain a plaintext corresponding to the at least one actually used field; and then, the plaintext corresponding to at least one truly used field is subjected to a preset signature algorithm to obtain a characteristic signature corresponding to the sample to be evaluated.

S103, inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model.

In this step, the electronic device may input the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and output the weight value of each feature of the sample to be evaluated through the offline model. Preferably, before inputting the feature signature corresponding to the sample to be evaluated into the pre-trained offline model, the electronic device may further input a predetermined correspondence between at least one sample field and at least one online prediction model, and a correspondence between at least one online prediction model and the features of each online prediction model into the offline model; the corresponding relation between at least one sample field and at least one online estimation model is a many-to-many relation; the corresponding relation between the characteristics of each online estimation model and the characteristics of each online estimation model is a many-to-one relation.

Because the features are nested in multiple levels, single features, combined features and intermediate features (which are not directly input into the model) are distinguished, and each model has hundreds of features (including the single features and the combined features), in order to explore the influence of the features on the model effect, the mapping relation between the model and the features and between the model and the sample needs to be firstly cleaned. The quality of the sample fields will affect the quality of the model features, while the quality of the features will affect the model effect, since the number of sample fields is very large, only the features that participate in the model feature calculation will affect the model effect.

Fig. 2 is a schematic structural diagram of a mapping relationship among models, features, and samples provided in the embodiment of the present application. As shown in fig. 2, when the corresponding relationship between the sample field and the online estimation model is a many-to-many relationship; the corresponding relation between the characteristics of the online pre-estimated model and the online pre-estimated model is a many-to-one relation; the corresponding relation between the sample field and the characteristics of the online estimation model is a many-to-many relation.

And S104, calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weight value of each characteristic of the sample to be evaluated.

In this step, the electronic device may calculate at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated. Specifically, the at least two evaluation indicators include at least two of: the consistency of the sample, the coverage rate of the sample, the vacancy rate of the sample, the density of the sample, the statistic value of the sample, the use frequency of the sample, the proportion of the characteristic signature after the duplication removal and the importance of the characteristic.

By adopting the characteristic quality evaluation method provided by the application, the evaluation indexes are more comprehensive: and establishing a comprehensive evaluation index system from the samples and features of model dependence and the intermediate yield feasign, wherein the evaluation index system has overall details. The feature importance, feasig fraction after deduplication, consistency of sample fields, coverage & null rate, distribution/dense rows, usage frequency and statistics (mean, maximum, minimum, variance, median, etc.) of sample fields can be drilled from model offline AUC or online point-to-area ratio (CTR model). The evaluation index is more readable: the evaluation indexes are correlated layer by layer, the fluctuation energy of the overall indexes can be solved in the detailed indexes, and the optimization of the detailed indexes can be embodied in the optimization of the overall indexes; particularly for the combined features of multi-level nesting and multi-feature combination, when the Q value fluctuates, the abnormal sample field can be quickly positioned from the fluctuation comparison of all dimension indexes of the features; when the model features are redundant, an effective optimization scheme can be quickly given by comparing index data of each feature.

According to the technical scheme, the technical problems that in the prior art, the characteristic quality can only be evaluated according to the estimated Q value based on a universal characteristic quality evaluation method, the evaluation index is too single, and more fine-grained comprehensive index evaluation is lacked for the characteristics with complex combination are solved.

According to the characteristic quality evaluation method provided by the embodiment of the application, at least one actually used field is extracted from a sample to be evaluated; then, at least one actually used field is converted to obtain a characteristic signature corresponding to the sample to be evaluated; inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model; and calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weight value of each characteristic of the sample to be evaluated. That is to say, the method and the device can calculate at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated. In the existing characteristic quality evaluation method, the characteristic quality can be evaluated only according to the estimated Q value. Because the technical means of calculating at least two evaluation indexes of each characteristic of the sample to be evaluated based on the weighted value of each characteristic of the sample to be evaluated is adopted, the technical problems that the characteristic quality can only be evaluated according to the estimated Q value, the evaluation indexes are too single, and more fine-grained comprehensive index evaluation is lacked for the characteristics with complex combination in the prior art are solved; moreover, the technical scheme of the embodiment of the application is simple and convenient to implement, convenient to popularize and wide in application range.

Example two

Fig. 3 is a second flowchart of the feature quality assessment method according to the embodiment of the present application. Further optimization and expansion are performed based on the technical scheme, and the method can be combined with the various optional embodiments. As shown in fig. 3, the feature quality evaluation method may include the steps of:

s301, extracting at least one actually used field from the sample to be evaluated.

S302, splicing and combining at least one actually used field to obtain a plaintext corresponding to the at least one actually used field.

S303, obtaining a characteristic signature corresponding to the sample to be evaluated by a plaintext corresponding to at least one truly used field through a preset signature algorithm.

S304, inputting the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and outputting the weight value of each feature of the sample to be evaluated through the offline model.

S305, inputting the weight values of the characteristics of the sample to be evaluated into a pre-trained index evaluation model, and outputting at least two evaluation indexes of the characteristics of the sample to be evaluated through the index evaluation model.

In this step, the electronic device may input the weight values of the features of the sample to be evaluated into a pre-trained index evaluation model, and output at least two evaluation indexes of the features of the sample to be evaluated through the index evaluation model; wherein the at least two evaluation indicators include at least two of: the consistency of the sample, the coverage rate of the sample, the vacancy rate of the sample, the density of the sample, the statistic value of the sample, the use frequency of the sample, the proportion of the characteristic signature after the duplication removal and the importance of the characteristic. Specifically, the electronic device may extract a node as a current node from a pre-trained index evaluation network; taking the characteristic signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; inputting an input vector corresponding to the current node into the current node, outputting an input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated output by the last node in the pre-trained index evaluation network are obtained.

The specific model characteristic quality evaluation index is described in detail as follows:

1) consistency of the samples: the online and offline consistency of the samples and the characteristics is that online and offline data sources and transmission in a recommendation system are independent from each other and are likely to be inconsistent, and the difference between online and offline data can cause that some characteristics cannot learn large model data corresponding to the characteristics in offline training, so that the online estimation result is different from a real expected value.

2) Coverage of samples and vacancy rate of samples: the coverage of the samples and the null rate of the samples largely determine the quality of a sample field. Fig. 4 is a schematic diagram of coverage and null rates of a sample provided in an embodiment of the present application. As shown in fig. 4, the higher the coverage rate, the lower the null rate, the sample field is generally regarded as a good-quality sample, which can help us to characterize more accurately, and we also verify this conclusion in a plurality of online fluctuations.

3) Consistency of the sample: for non-numerical fields, dense rows of the horizontal sample fields can be obtained by calculating top value distribution and ratio; for example, a field with 99% of its value and the remaining 1% of its value is not recommended for direct use.

4) Statistical value of the sample: for the numerical sample field, the value difference of the sample can be evaluated in an auxiliary manner through indexes such as a maximum value, a minimum value, a mean value, a median value, a variance and the like. Fig. 5 is a schematic diagram of statistics of a sample provided in an embodiment of the present application. As shown in fig. 5, for 2020-08-01, 23: 00: a statistical value corresponding to 00, with a median of-1.0000; the mean value is-0.8239; the standard deviation is: 0.7283, respectively; the minimum value is: -1.0000; the maximum value is: 3.0000.

5) frequency of use of the sample: the use frequency of the sample field reflects the influence surface of the field on the model effect from the other surface.

6) Ratio of feature signature: the ratio of the feature signature (feasign) depends on the coverage rate of the sample to a certain extent, the higher the coverage rate of the feature is, the less the value is lost, and the storage cost of the feature is reflected on the side face.

7) Ratio of the feature signature after the duplication removal: the feasig proportion can show the coverage rate of the features, but the situations of fixed values, null values and dense value distribution cannot be reflected, and the feasig proportion after the duplication removal can enable people to more accurately feel the richness and the difference of the values of the features and can also more visually see the feature optimization direction.

8) Importance of features: this is a conventional assessment index, which outputs the AUC by randomly replacing the input of the specified feature, and then comparing with the original AUC, the larger the difference, the more important the feature is. AUC (area Under curve) is defined as the area enclosed by the coordinate axes Under the ROC curve, and it is obvious that the value of this area is not larger than 1. Since the ROC curve is generally located above the line y ═ x, the AUC ranges between 0.5 and 1.

The scheme adopted in the industry at present is single in evaluation index, for the characteristics with complex combination, a comprehensive evaluation index with fine granularity and coarse granularity is lacked, only global evaluation can be performed, the influence of a single point cannot be evaluated, meanwhile, the optimization scheme aiming at the evaluation result is also single, namely the characteristics with low contribution factors are cut, the root cause of the problem cannot be traced through the result, the quality of the characteristics is improved from the source, alternative high-quality sample fields are searched for sample fields which cannot be optimized, if the optimization schemes are not satisfied, the cutting scheme is adopted, based on cost and income balance, the low-efficiency characteristics are cut, the healthy iteration of the model characteristics is guaranteed, and meanwhile, the Q value fluctuation problem caused by the characteristic quality is not facilitated to be positioned, analyzed and solved. According to the method, the characteristic quality of the model can be evaluated from multiple dimensions, for example, the correctness (such as consistency and statistical value) of the sample field, the completeness (coverage rate and null value rate), the difference (density and statistical value and the like), the difference (proportion after de-weighting) and the completeness (proportion) of the feasig can show the data quality from different layers, positive correlation exists with the importance of the characteristic, the sample field is used frequently to show the amplification factor of the field quality problem in the model, the higher the use frequency is, the fields with higher influence characteristic importance also need to be focused on data governance, data monitoring and the like. The sample field and the characteristic quality problem are transmitted to the model, the model effect is influenced, and the model is overstaffed.

Fig. 6 is a schematic diagram of an influence of a sample field and a feature quality problem on a model according to an embodiment of the present application. As shown in fig. 6, problems that may occur with the sample field include: data duplication, data deletion, calculation error, definition error; problems that may occur with feature quality include: inconsistent, incomplete, unconformity, uncontrollable, and redundant features; problems that may occur with the model include: inaccurate model and overstaffed model.

EXAMPLE III

Fig. 7 is a schematic structural diagram of a feature quality evaluation apparatus according to an embodiment of the present application. As shown in fig. 7, the apparatus 700 includes: an extraction module 701, a conversion module 702, a first calculation module 703 and a second calculation module 704; wherein the content of the first and second substances,

the extraction module 701 is configured to extract at least one actually used field from a sample to be evaluated;

the conversion module 702 is configured to perform conversion processing on the at least one truly used field to obtain a feature signature corresponding to the sample to be evaluated;

the first calculating module 703 is configured to input the feature signature corresponding to the sample to be evaluated into a pre-trained offline model, and output a weight value of each feature of the sample to be evaluated through the offline model;

the second calculating module 704 is configured to calculate at least two evaluation indexes of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated.

Further, the at least two evaluation indicators include at least two of: the consistency of the sample, the coverage rate of the sample, the vacancy rate of the sample, the density of the sample, the statistic value of the sample, the use frequency of the sample, the proportion of the characteristic signature after the duplication removal and the importance of the characteristic.

FIG. 8 is a schematic structural diagram of a conversion module provided in an embodiment of the present application. As shown in fig. 8, the conversion module 702 includes: a splicing combination sub-module 7021 and a calculation sub-module 7022; wherein the content of the first and second substances,

the splicing and combining sub-module 7021 is configured to splice and combine the at least one actually used field to obtain a plaintext corresponding to the at least one actually used field;

the calculating submodule 7022 is configured to obtain a feature signature corresponding to the sample to be evaluated by using a preset signature algorithm on the plaintext corresponding to the at least one really used field.

Further, the first calculating module 703 is further configured to input a predetermined correspondence between at least one sample field and at least one online prediction model, and a correspondence between the at least one online prediction model and characteristics of each online prediction model into the offline model; the corresponding relation between the at least one sample field and the at least one online pre-estimation model is a many-to-many relation; the corresponding relation between the characteristics of each online estimation model and each online estimation model is a many-to-one relation.

Further, the second calculating module 704 is specifically configured to input a weight value of each feature of the sample to be evaluated into a pre-trained index evaluation model, and output at least two evaluation indexes of each feature of the sample to be evaluated through the index evaluation model.

Further, the second calculating module 704 is specifically configured to extract a node from the pre-trained index evaluation network as a current node; taking the feature signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated, which are output by the last node in the pre-trained index evaluation network, are obtained.

The characteristic quality evaluation device can execute the method provided by any embodiment of the application, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the feature quality assessment method provided in any embodiment of the present application.

Example four

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the feature quality evaluation method. For example, in some embodiments, the feature quality assessment method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When loaded into RAM 903 and executed by computing unit 901, may perform one or more steps of the above-described feature quality assessment method. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the feature quality assessment method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome. The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of feature quality assessment, the method comprising:

extracting at least one actually used field from a sample to be evaluated;

2. The method of claim 1, wherein the at least two assessment indicators comprise at least two of: the consistency of the sample, the coverage rate of the sample, the vacancy rate of the sample, the density of the sample, the statistic value of the sample, the use frequency of the sample, the proportion of the characteristic signature after the duplication removal and the importance of the characteristic.

3. The method according to claim 1, wherein the converting the at least one actually used field to obtain the feature signature corresponding to the sample to be evaluated comprises:

splicing and combining the at least one really used field to obtain a plaintext corresponding to the at least one really used field;

and obtaining a characteristic signature corresponding to the sample to be evaluated by a plaintext corresponding to the at least one truly used field through a preset signature algorithm.

4. The method according to claim 3, before the inputting the feature signature corresponding to the sample to be evaluated into the pre-trained offline model, the method further comprising:

inputting the corresponding relation between at least one predetermined sample field and at least one online estimation model and the corresponding relation between the at least one online estimation model and the characteristics of each online estimation model into the offline model; the corresponding relation between the at least one sample field and the at least one online pre-estimation model is a many-to-many relation; the corresponding relation between the characteristics of each online estimation model and each online estimation model is a many-to-one relation.

5. The method according to claim 1, wherein the calculating at least two evaluation indicators of each feature of the sample to be evaluated based on the weight value of each feature of the sample to be evaluated comprises:

and inputting the weight value of each characteristic of the sample to be evaluated into a pre-trained index evaluation model, and outputting at least two evaluation indexes of each characteristic of the sample to be evaluated through the index evaluation model.

6. The method according to claim 5, wherein the inputting the weight values of the respective features of the sample to be evaluated into a pre-trained index evaluation model, and outputting at least two evaluation indexes of the respective features of the sample to be evaluated through the index evaluation model comprises:

extracting a node from the pre-trained index evaluation network as a current node; taking the feature signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated, which are output by the last node in the pre-trained index evaluation network, are obtained.

7. A feature quality assessment apparatus, the apparatus comprising: the device comprises an extraction module, a conversion module, a first calculation module and a second calculation module; wherein the content of the first and second substances,

8. The apparatus of claim 7, the at least two assessment indicators comprising at least two of: the consistency of the sample, the coverage rate of the sample, the vacancy rate of the sample, the density of the sample, the statistic value of the sample, the use frequency of the sample, the proportion of the characteristic signature after the duplication removal and the importance of the characteristic.

9. The apparatus of claim 7, the conversion module comprising: a splicing combination submodule and a calculation submodule; wherein the content of the first and second substances,

the splicing combination submodule is used for splicing and combining the at least one really used field to obtain a plaintext corresponding to the at least one really used field;

and the calculation submodule is used for subjecting the plaintext corresponding to the at least one truly used field to a preset signature algorithm to obtain a characteristic signature corresponding to the sample to be evaluated.

10. The apparatus of claim 9, the first computing module further configured to input a predetermined correspondence between at least one sample field and at least one online predictive model, and a correspondence between the at least one online predictive model and characteristics of each online predictive model into the offline model; the corresponding relation between the at least one sample field and the at least one online pre-estimation model is a many-to-many relation; the corresponding relation between the characteristics of each online estimation model and each online estimation model is a many-to-one relation.

11. The apparatus according to claim 7, wherein the second calculating module is specifically configured to input a weight value of each feature of the sample to be evaluated into a pre-trained index evaluation model, and output at least two evaluation indexes of each feature of the sample to be evaluated through the index evaluation model.

12. The apparatus according to claim 11, wherein the second computing module is specifically configured to extract a node from the pre-trained metric evaluation network as a current node; taking the feature signature corresponding to the sample to be evaluated as an input vector corresponding to the current node; inputting the input vector corresponding to the current node into the current node, outputting the input vector corresponding to the next node through the current node, and repeatedly executing the operation of extracting the current node until at least two evaluation indexes of each characteristic of the sample to be evaluated, which are output by the last node in the pre-trained index evaluation network, are obtained.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6.