WO2020259325A1 - Feature processing method applicable to machine learning, and device - Google Patents

Feature processing method applicable to machine learning, and device Download PDF

Info

Publication number
WO2020259325A1
WO2020259325A1 PCT/CN2020/095934 CN2020095934W WO2020259325A1 WO 2020259325 A1 WO2020259325 A1 WO 2020259325A1 CN 2020095934 W CN2020095934 W CN 2020095934W WO 2020259325 A1 WO2020259325 A1 WO 2020259325A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
relationship
processing
dependency
dependent
Prior art date
Application number
PCT/CN2020/095934
Other languages
French (fr)
Chinese (zh)
Inventor
兰冲
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2020259325A1 publication Critical patent/WO2020259325A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Definitions

  • the present invention relates to the technical field of financial technology (Fintech), in particular to a feature processing method and device suitable for machine learning.
  • Hive data warehouse storage features are usually used in the prior art, and the data warehouse can provide SQL processing features and the ability to store features.
  • the embodiments of the present invention provide a feature processing method and device suitable for machine learning, which at least solves the problem of the lack of unified management features and feature processing logic in the prior art, which leads to reduced accuracy of supervision data analysis.
  • an embodiment of the present invention provides a feature processing method suitable for machine learning, including:
  • the feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic.
  • the feature list At least one feature is included in the feature table, the dependency feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature that needs to be processed;
  • parallel feature processing is performed on the feature table with no dependent relationship, so as to obtain training data according to the features after the parallel processing.
  • the data features in the database are saved in the form of a feature table.
  • the feature table multiple features and the processing logic of these features are included, and in order to facilitate feature processing, the data associated with the The feature table has a feature table with a dependency relationship; in this way, the dependency relationship between the feature tables can be maintained, and the dependency relationship between the feature tables can be reused, reducing the repeated calculation cost of the dependency relationship between the feature tables .
  • the features are managed through the feature table, which can clearly express the dependencies between features, bring convenience to feature addition, deletion, and maintenance, and make subsequent training data more accurate, thereby improving The accuracy of regulatory data analysis.
  • the determining the feature dependency relationship according to the feature to be processed and each feature in the feature pool includes:
  • the feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
  • the dependency relationship between features can be better sorted through the form of the feature dependency tree, which is convenient for feature processing and feature processing.
  • the determining a feature table of a relationship that is not currently dependent based on the feature dependency relationship, and adding the feature table of a relationship that is currently not dependent as a parallel subtask into the feature processing path includes:
  • the method further includes:
  • the processed features are passed through multiple consecutive processing steps to obtain machine features.
  • an embodiment of the present invention provides a feature processing device suitable for machine learning, including:
  • the obtaining unit is used to obtain the feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table.
  • the feature table is at least composed of a feature list, belonging feature library, dependent feature table, belonging business, and feature processing logic ,
  • the feature list includes at least one feature
  • the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table
  • the feature processing request includes the feature to be processed;
  • the feature processing path determination unit is used to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and to determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and to select the relationship that currently has no dependent relationship.
  • the feature table is added to the feature processing path as a parallel subtask;
  • the feature processing unit is configured to perform parallel feature processing on the feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.
  • the feature processing path determining unit is specifically configured to:
  • the feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
  • the feature processing path determining unit is specifically configured to:
  • the feature processing unit is further configured to:
  • the processed features are passed through multiple consecutive processing steps to obtain machine features.
  • an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes a computer program that is suitable for machine learning. Feature processing method steps.
  • an embodiment of the present invention provides a computer-readable storage medium that stores a computer program executable by a computer device.
  • the program runs on a computer device
  • the computer device executes a computer program suitable for machine learning. Feature processing method steps.
  • FIG. 1 is a schematic flowchart of a feature processing method suitable for machine learning according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a feature management structure provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of service level management of a feature table provided by an embodiment of the present invention.
  • Figure 4 is a schematic diagram of a feature dependency tree provided by an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of a feature processing pipeline provided by an embodiment of the present invention.
  • FIG. 6 is a schematic flowchart of a feature processing method suitable for machine learning according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a feature processing device suitable for machine learning according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • Feature engineering The process of obtaining, sorting, and processing features that can be understood and easily processed by computer programs from data. The main purpose is to provide input data for training, evaluation, and prediction for machine learning.
  • Machine learning refers to the process of automatically analyzing and obtaining laws from data, such as computer programs, and using the laws to predict unknown data.
  • Missing value processing the processing method when feature data is missing, such as filling with 0.
  • Machine features features processed by machine learning algorithms.
  • One hot code which maps multiple values of a feature into multiple bits; the bit corresponding to the feature value is 1, and the other bits are 0.
  • Topological sorting A sorting algorithm that ranks the elements that are not dependent on first.
  • In-degree a node in a directed graph, the number of edges pointing to the node.
  • Out-degree the number of edges that point to other nodes in a node in a directed graph.
  • Machine learning in the prior art usually requires some training data, which is determined through feature engineering.
  • the Hive data warehouse is usually used to store features, which can provide SQL processing features and the ability to store features.
  • an embodiment of the present invention provides a feature processing method suitable for machine learning, as shown in FIG. 1 specifically, including the following steps:
  • Step S101 in the embodiment of the present invention, if the feature processing request corresponding to the data is obtained, the feature processing is performed through the feature pool composed of the features related to the feature processing request.
  • These features are defined in the form of a feature table, and the feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic.
  • the feature list includes at least one feature, and the dependent feature
  • the table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed.
  • the feature processing request corresponding to the data may be a request for extracting and processing some features and related data, for example, a residential area suitable for the elderly, a residential area suitable for office workers, and so on.
  • the original data is obtained.
  • the original data can include information provided by BD maps, SG maps, etc.
  • the features and attributes can be community facility information, traffic convenience, etc.
  • statistical models or The process of modeling these features by technologies such as machine learning models. We can divide the feature processing process into two stages. In the first stage, the original data is processed into natural features.
  • the natural features focus on the meaning of the feature itself, such as the age, occupation, annual income of the customer, the size of the company’s employees, and office location. Wait. Some natural features can be obtained directly from the original data, and some features need to be obtained through complex processing logic. In the second stage, natural features are processed into machine features.
  • the processing method of machine features depends on the input requirements of machine learning algorithms. Different algorithms require different processing methods. For example, deep learning algorithms often need to process directory attributes into one-hot codes, while decision tree algorithms can directly process directory attributes.
  • the features are stored in the database through the feature table.
  • the feature table t includes multiple natural features f.
  • the feature library in the embodiment of the present invention may correspond to the library in the data warehouse, or may not correspond to the library in the data warehouse.
  • the feature table in the embodiment of the present invention may correspond to the library in the data warehouse. The table may not correspond to the table in the data warehouse, and there is no logical dependency.
  • the multiple features included in the feature table are defined in the form of a feature list, that is, each feature table includes part of the feature list, and the part of the feature list includes at least A feature, the representation of the feature in the feature list, can be as shown in Table 1.
  • Table 1 is only a feature identification method.
  • feature representation elements can also be deleted or added.
  • the feature table in addition to the feature list, also includes a dependent feature table.
  • the feature table includes feature A, and feature A has a dependent relationship with feature B.
  • Feature B belongs to feature table B, so in the dependent feature table Feature table B is included in.
  • the feature table also includes part of the content of the belonging feature library, the belonging business and the feature processing logic.
  • the belonging feature library refers to which library the feature table belongs to
  • the belonging service refers to which service the feature in the feature table belongs to.
  • the embodiment of the present invention provides a general service level division method, although other division methods are also possible.
  • the service level can be regarded as a mark of the feature table, and the same feature table can have multiple service marks of the same level, such as feature table 3: application 1, application 2.
  • feature table 1 Model 1, Application 2 is not allowed.
  • the processing logic includes a processing program and a program configuration; the processing program can be an SQL statement or other programs that can run in a specific environment, and the program configuration must be completed before running.
  • the processing program is only responsible for processing the features, and does not care how to save the target feature data. For example, if the processing program is SQL, it will not contain logic like insert into [target table] or insert overwrite [target table]; On the contrary, the saving behavior of features is controlled and tracked by the runtime system.
  • Step S102 Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the current non-dependent relationship according to the feature dependency, and use the feature table of the current non-dependent relationship as a parallel child The task is added to the feature processing path.
  • the feature table of the relationship that is currently not dependent can be determined according to the dependence relationship, and the feature table of the relationship that is currently not dependent
  • the feature processing path is added as a subtask of parallel processing. In this way, the feature processing path can be determined, thereby improving the feature processing efficiency and facilitating feature management.
  • the feature dependency relationship can be determined layer by layer, for example, the feature to be processed is feature A, and feature A is stored in feature table 1, feature table 2, and feature table 3; and feature table 1 and feature 4 has a dependency relationship, feature table 2 has a dependency relationship with feature table 5, feature table 3 has a dependency relationship with feature table 6, so when processing feature A, you need to process feature table 4, feature table 5, and feature table 6 first, and then Reprocessing feature table 1, feature table 2, and feature table 3.
  • the foregoing method may be determined by a simple topological sorting method, that is, the processing path is finally obtained by performing dependent sorting on the features to be processed.
  • the feature that needs to be processed is used as the root node, and the feature table that has a direct or indirect dependence relationship with the root node is used as The upper node builds a feature dependency tree. In other words, take the feature to be processed as the root node, and then gradually build up nodes to the upper level to form a tree-like dependency.
  • the feature to be processed is feature A, and feature A is stored in feature table 1, feature table 2, and feature table 3; while feature table 1 has a dependency relationship with feature 4, and feature table 2 has a dependency with feature table 5 Relationship, feature table 3 and feature table 6 have a dependency relationship, and feature table 4, feature table 5, and feature table 6 are obtained by processing the features in original table 1, original table 2, and original table 3, which forms a dependency tree As shown in Figure 4.
  • the processing sequence between features can be quickly determined through topological sorting.
  • one batch of features can be processed in parallel to improve the efficiency of feature processing.
  • original table 1, original table 2, and original table 3 can be For one batch, there is no dependency relationship.
  • the information provided by the BD map is stored in the original table 1, and the information provided by the BD map is relatively complete; the information provided by the SG map is stored in the original table 2.
  • the school information is relatively complete; the original table 3 stores the information provided by the WW map, and the supermarket information is relatively complete in the information provided by the WW map.
  • the original table 1, the original table 2, and the original table 3 can be processed at the same time to obtain the characteristic table 4, that is, the specific information of each hospital is public or private, whether it is a tertiary hospital, etc.; the characteristic table 5, namely, each The specific information of the school is public or private, whether it is a university, middle school, or elementary school, whether it is a national key school, etc.; and feature table 6, that is, the specific information of each supermarket, focusing on fresh food or daily necessities, service quality and product quality The stars and so on.
  • the characteristic table 4 that is, the specific information of each hospital is public or private, whether it is a tertiary hospital, etc.
  • the characteristic table 5 namely, each The specific information of the school is public or private, whether it is a university, middle school, or elementary school, whether it is a national key school, etc.
  • feature table 6 that is, the specific information of each supermarket, focusing on fresh food or daily necessities, service quality and product quality The stars and so on.
  • feature table 4, feature table 5, and feature table 6 can be processed at the same time to obtain feature table 1, that is, the information of the neighborhood near the top three hospitals, the feature table 2, the neighborhood information of the 211 university, and the feature table 3. That is, information on the neighborhoods near the supermarket with five-star service quality and product quality. In this way, it is possible to determine the feature A that needs to be processed, that is, the residential area suitable for the elderly.
  • a feature processing path generation algorithm which specifically includes:
  • the set C is not empty, the current traversal is S1 and C1, for example, S1 is feature table 4, C1 is original table 1, original table 1 is the dependency table of feature table 4, and the feature table is deleted Into the degree.
  • S2 is feature table 6
  • C2 is original table 2
  • these two feature tables have no dependency relationship
  • continue to traverse S3 is feature table 5
  • C3 is original table 2
  • original table 2 is the dependency table of feature table 4
  • Delete the in-degree of feature table 5 until the in-degree of feature table 6 is also deleted; then update the C table, at this time the C table includes the feature table, use the above to continue to delete the in-degree of the feature table, and delete
  • the in-degree feature table forms a parallel subtask, then clears the C table, and continues the above steps until the in-degrees of all the feature tables are deleted, forming multiple parallel subtasks.
  • each step of the algorithm composes tables that are not currently dependent on a parallel subtask, and parallel operation can speed up the overall operating efficiency.
  • Step S103 Perform parallel feature processing on the feature table with no dependent relationships according to the feature processing path, so as to obtain training data based on the parallel processed features.
  • the required feature can be processed through the determined feature processing path. After the processing is completed, the first step is completed, the process of processing the original feature into a natural feature, and then the natural feature needs to be processed. Processing is a machine feature.
  • machine features can be obtained through multiple consecutive processing steps, and after each processing step, the processing results can be saved to facilitate subsequent feature utilization. For example, in the embodiment of the present invention, such as whether the catalog attribute customer smokes: Yes
  • the process of obtaining machine features through multiple consecutive processing steps may also be referred to as a machine feature processing pipeline.
  • the processing from natural features to machine features is performed in the dimension of a single feature.
  • Multiple features can also share a pipeline.
  • Multiple processing steps form a processing pipeline, and the steps on the pipeline receive the output of the previous step, and output to the next step after processing. Each step can output a step status or not.
  • Each step in the pipeline needs to support processing one or more features. Because there may only be one feature when input to the pipeline, but a certain step in the middle may turn one feature into multiple features.
  • the one-hot code adds a feature to each value of the feature. For example, the feature of whether a customer smokes is treated as two features: the customer smokes and the customer does not smoke.
  • the intermediate state of the feature processing can be saved, and the intermediate process can be set in a custom way.
  • the normalization process in the above example is set to mean and variance. Step, you can save the features in the normalization process for feature reuse.
  • Feature processing method suitable for machine learning provided by the embodiments of the present application will be described below in conjunction with specific implementation scenarios.
  • the method is used to extract feature S, which is located in feature table 1.
  • Feature table 1 has an associated relationship with feature table 2, feature table 3, and feature table 4, and feature table 2 has an associated relationship with feature table 5 and feature table 6, as shown in Figure 6:
  • Step S601 Obtain a characteristic processing request corresponding to the data
  • Step S602 construct a feature pool with features in feature table 1, feature table 2, feature table 3, feature table 4, feature table 5, and feature table 6;
  • Step S603 Construct a dependency tree for the features in the feature pool.
  • the dependency tree can be embodied as: feature S is the root node, the upper node of the root node is feature table 1, and the upper nodes of feature table 1 are feature table 2, feature table 3, Feature table 4, the upper nodes of feature table 2 are feature table 5 and feature table 6;
  • Step S604 Find out the set of tables that are not currently dependent, and delete the association between the dependence graph and the tables in the set, so as to generate the next batch of non-dependent tables, until all tables are added to the processing sequence to obtain the processing sequence.
  • Feature Table 5 Feature Table 6> Feature Table 2, Feature Table 3, Feature Table 4> Feature Table 1;
  • Step S605 performing feature processing according to the processing sequence to obtain feature S;
  • step S606 the feature S is passed through multiple steps to obtain the machine feature T, and the feature results of the multiple steps are saved.
  • the device 700 includes:
  • the obtaining unit 701 is configured to obtain a feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table.
  • the feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic.
  • the feature list includes at least one feature
  • the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table
  • the feature processing request includes features that need to be processed
  • the feature processing path determination unit 702 is configured to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is not currently dependent according to the feature dependency relationship, and to change the relationship that is currently not dependent
  • the feature table of is added to the feature processing path as a parallel subtask;
  • the feature processing unit 703 is configured to perform parallel feature processing on a feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.
  • the characteristic processing path determining unit 702 is specifically configured to:
  • the feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
  • the characteristic processing path determining unit 702 is specifically configured to:
  • the feature processing unit 703 is further configured to:
  • the processed features are passed through multiple consecutive processing steps to obtain machine features.
  • an embodiment of the present application provides a computer device, as shown in FIG. 8, including at least one processor 801 and a memory 802 connected to the at least one processor.
  • the embodiment of the present application does not limit the processor
  • the connection between the processor 801 and the memory 802 in FIG. 8 is taken as an example.
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the memory 802 stores instructions that can be executed by at least one processor 801. By executing the instructions stored in the memory 802, the at least one processor 801 can execute the aforementioned feature processing methods suitable for machine learning. A step of.
  • the processor 801 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the terminal equipment, and obtain customers by running or executing instructions stored in the memory 802 and calling data stored in the memory 802. End address.
  • the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor.
  • the application processor mainly processes the operating system, user interface, and application programs.
  • the adjustment processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 801.
  • the processor 801 and the memory 802 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.
  • the processor 801 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the memory 802 can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules.
  • the memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk , CD, etc.
  • the memory 802 is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto.
  • the memory 802 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.
  • the embodiments of the present application provide a computer-readable storage medium that stores a computer program that can be executed by a computer device.
  • the program runs on the computer device, the computer device can execute The steps of the feature processing method of machine learning.
  • a person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware.
  • the foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.
  • ROM read-only memory
  • RAM Random Access Memory
  • magnetic disks or optical disks etc.
  • the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

A feature processing method applicable to machine learning, and a device. The method comprises: upon acquiring a feature processing request corresponding to data, constructing a feature pool according to features in each feature table, wherein the feature table at least consists of a feature list, an associated feature library, a dependency feature table, an associated service, and a feature transforming logic, the feature list comprises one or more features, the dependency feature table is used to record other feature tables having a dependency relationship with the feature table, and the feature processing request comprises a feature to be processed (S101); determining a feature dependency relationship according to the feature to be processed and each feature in the feature pool, and determining a feature processing path according to the feature dependency relationship (S102); and performing feature processing according to the feature processing path to obtain training data according to features that have undergone concurrent processing (S103).

Description

一种适用于机器学习的特征处理方法及装置Feature processing method and device suitable for machine learning
相关申请的交叉引用Cross references to related applications
本申请要求在2019年06月26日提交中国专利局、申请号为201910562484.1、申请名称为“一种适用于机器学习的特征处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 26, 2019, the application number is 201910562484. 1, the application title is "a feature processing method and device suitable for machine learning", the entire content of which is by reference Incorporated in this application.
技术领域Technical field
本发明涉及金融科技(Fintech)技术领域,尤其涉及一种适用于机器学习的特征处理方法及装置。The present invention relates to the technical field of financial technology (Fintech), in particular to a feature processing method and device suitable for machine learning.
背景技术Background technique
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(Finteh)转变,特征处理技术也不例外,但由于金融行业的安全性、实时性要求,也对技术提出的更高的要求。With the development of computer technology, more and more technologies are applied in the financial field. The traditional financial industry is gradually transforming to Finteh. Feature processing technology is no exception. However, due to the security and real-time requirements of the financial industry, There are also higher requirements for technology.
目前的监管系统中,通常会对大量的数据进行分析,一般的,会采用机器学习算法模型的特征工程进行分析,特征工程是将原始数据转化为特征的过程,这些特征可以更好地向预测模型描述潜在问题,从而提高模型对未见数据的准确性。现有技术中通常利用Hive数据仓库存储特征,数据仓库可以提供SQL加工特征以及存储特征的能力。但是现有技术中没有统一管理特征以及特征加工的逻辑,无法清晰表达特征之间的依赖关系,为特征增加、删除、维护等带来不便,导致算法模型中的训练数据也不准确,相应的,降低了监管数据的分析准确性。In the current supervision system, a large amount of data is usually analyzed. Generally, feature engineering of machine learning algorithm models is used for analysis. Feature engineering is the process of transforming original data into features. These features can be better predicted The model describes potential problems, thereby improving the accuracy of the model for unseen data. Hive data warehouse storage features are usually used in the prior art, and the data warehouse can provide SQL processing features and the ability to store features. However, in the prior art, there is no unified management feature and feature processing logic, and the dependency relationship between features cannot be clearly expressed, which brings inconvenience to feature addition, deletion, and maintenance, resulting in inaccurate training data in the algorithm model. , Which reduces the analysis accuracy of regulatory data.
发明内容Summary of the invention
有鉴于此,本发明实施例提供一种适用于机器学习的特征处理方法及装置,至少解决了现有技术存在的没有统一管理特征以及特征加工的逻辑,导 致监管数据分析准确性降低的问题。In view of this, the embodiments of the present invention provide a feature processing method and device suitable for machine learning, which at least solves the problem of the lack of unified management features and feature processing logic in the prior art, which leads to reduced accuracy of supervision data analysis.
一方面,本发明实施例提供一种适用于机器学习的特征处理方法,包括:On the one hand, an embodiment of the present invention provides a feature processing method suitable for machine learning, including:
获取数据对应的特征处理请求后,根据各特征表中的各个特征构建特征池,所述特征表至少由特征列表、所属特征库、依赖特征表、所属业务、特征加工逻辑构成,所述特征列表中包括至少一个特征,所述依赖特征表用于记录与各特征表具有依赖关系的其它特征表,所述特征处理请求中包括需要处理的特征;After obtaining the feature processing request corresponding to the data, construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list At least one feature is included in the feature table, the dependency feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature that needs to be processed;
根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,并根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径;Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and add the feature table of the relationship that currently has no dependent relationship as a parallel subtask to the feature Processing path
根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理,以根据并行处理后的特征得到训练数据。According to the feature processing path, parallel feature processing is performed on the feature table with no dependent relationship, so as to obtain training data according to the features after the parallel processing.
本发明实施例中,数据库中的数据特征是以特征表的形式进行保存的,在该特征表中,包括了多个特征以及这些特征的加工逻辑,并且为了便于特征加工,还保存了与该特征表存在依赖关系的特征表;如此可以对各特征表之间的依赖关系进行维护,可以使得各特征表之间的依赖关系的重复使用,减少各特征表之间的依赖关系的反复计算成本。在进行特征处理任务时,将在任务中使用到的所有特征表中的特征构建处理特征依赖关系,通过各个依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径,并根据特征处理路径对当前没有依赖的关系的特征表进行并行特征处理;如此实现多个特征表的并行处理,加快特征处理的效率。在本发明实施例中,特征是通过特征表进行管理的,能够清晰的表达特征之间的依赖关系,为特征增加、删除、维护等带来便利,使得后续的训练数据更加精确,从而提升了对监管数据分析的准确性。In the embodiment of the present invention, the data features in the database are saved in the form of a feature table. In the feature table, multiple features and the processing logic of these features are included, and in order to facilitate feature processing, the data associated with the The feature table has a feature table with a dependency relationship; in this way, the dependency relationship between the feature tables can be maintained, and the dependency relationship between the feature tables can be reused, reducing the repeated calculation cost of the dependency relationship between the feature tables . When performing a feature processing task, construct a processing feature dependency relationship for the features in all feature tables used in the task, determine the feature table of the relationship that is not currently dependent through each dependency relationship, and use the feature table of the relationship that is currently not dependent as Parallel subtasks are added to the feature processing path, and the feature tables that are not currently dependent are processed in parallel according to the feature processing path; in this way, the parallel processing of multiple feature tables is realized and the efficiency of feature processing is accelerated. In the embodiment of the present invention, the features are managed through the feature table, which can clearly express the dependencies between features, bring convenience to feature addition, deletion, and maintenance, and make subsequent training data more accurate, thereby improving The accuracy of regulatory data analysis.
可选的,所述根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,包括:Optionally, the determining the feature dependency relationship according to the feature to be processed and each feature in the feature pool includes:
以所述需要处理的特征作为根节点,将与根节点具有直接依赖关系或者间接依赖关系的特征表作为上层节点,构建特征依赖树。The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
在本发明实施例中,通过特征依赖树的形式能够更好的梳理特征之间的依赖关系,便于进行特征加工以及特征处理。In the embodiment of the present invention, the dependency relationship between features can be better sorted through the form of the feature dependency tree, which is convenient for feature processing and feature processing.
可选的,所述根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径,包括:Optionally, the determining a feature table of a relationship that is not currently dependent based on the feature dependency relationship, and adding the feature table of a relationship that is currently not dependent as a parallel subtask into the feature processing path includes:
确定所述特征依赖树中当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第一加工路径中,删除当前没有依赖的关系的特征表与所述特征依赖树中其它特征表的关联,返回确定所述特征依赖树中当前没有依赖的关系的特征表的步骤,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第二加工路径中,直到将所述依赖树中所有特征表加入到特征处理路径表中。Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
本发明实施例中,通过在特征依赖树中逐步确定加工序列的方式,能够将多个特征表同时加工,并且能够梳理特征之间的加工序列,提高了特征处理的效率。In the embodiment of the present invention, by gradually determining the processing sequence in the feature dependency tree, multiple feature tables can be processed at the same time, and the processing sequences between features can be sorted, which improves the efficiency of feature processing.
可选的,所述根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理后,还包括:Optionally, after the parallel feature processing is performed on the feature table with no dependent relationship according to the feature processing path, the method further includes:
将处理后的特征经过多个连续的处理步骤得到机器特征。The processed features are passed through multiple consecutive processing steps to obtain machine features.
本发明实施例中,通过多个连续的处理步骤,能够实现在特征处理工程中存在多个中间状态,能够通过配置修改任一步骤而不需要修改其它步骤,就可以实现修改过程,并且能够灵活的运用中间状态的特征处理结果。In the embodiment of the present invention, through multiple consecutive processing steps, multiple intermediate states can be realized in the feature processing project, and any step can be modified through configuration without modifying other steps, and the modification process can be realized, and it can be flexible The result of using the features of the intermediate state.
一方面,本发明实施例提供一种适用于机器学习的特征处理装置,包括:On the one hand, an embodiment of the present invention provides a feature processing device suitable for machine learning, including:
获取单元,用于获取数据对应的特征处理请求后,根据各特征表中的各个特征构建特征池,所述特征表至少由特征列表、所属特征库、依赖特征表、所属业务、特征加工逻辑构成,所述特征列表中包括至少一个特征,所述依赖特征表用于记录与各特征表具有依赖关系的其它特征表,所述特征处理请求中包括需要处理的特征;The obtaining unit is used to obtain the feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, belonging feature library, dependent feature table, belonging business, and feature processing logic , The feature list includes at least one feature, the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed;
特征处理路径确定单元,用于根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,并根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径;The feature processing path determination unit is used to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and to determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and to select the relationship that currently has no dependent relationship. The feature table is added to the feature processing path as a parallel subtask;
特征处理单元,用于根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理,以根据并行处理后的特征得到训练数据。The feature processing unit is configured to perform parallel feature processing on the feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.
可选的,所述特征处理路径确定单元具体用于:Optionally, the feature processing path determining unit is specifically configured to:
以所述需要处理的特征作为根节点,将与根节点具有直接依赖关系或者间接依赖关系的特征表作为上层节点,构建特征依赖树。The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
可选的,所述特征处理路径确定单元具体用于:Optionally, the feature processing path determining unit is specifically configured to:
确定所述特征依赖树中当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第一加工路径中,删除当前没有依赖的关系的特征表与所述特征依赖树中其它特征表的关联,返回确定所述特征依赖树中当前没有依赖的关系的特征表的步骤,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第二加工路径中,直到将所述依赖树中所有特征表加入到特征处理路径表中。Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
可选的,所述特征处理单元还用于:Optionally, the feature processing unit is further configured to:
将处理后的特征经过多个连续的处理步骤得到机器特征。The processed features are passed through multiple consecutive processing steps to obtain machine features.
一方面,本发明实施例提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序实现的适用于机器学习的特征处理方法的步骤。On the one hand, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes a computer program that is suitable for machine learning. Feature processing method steps.
一方面,本发明实施例提供了一种计算机可读存储介质,其存储有可由计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得所述计算机设备执行适用于机器学习的特征处理方法的步骤。On the one hand, an embodiment of the present invention provides a computer-readable storage medium that stores a computer program executable by a computer device. When the program runs on a computer device, the computer device executes a computer program suitable for machine learning. Feature processing method steps.
附图说明Description of the drawings
图1为本发明实施例提供的一种适用于机器学习的特征处理方法的流程 示意图;FIG. 1 is a schematic flowchart of a feature processing method suitable for machine learning according to an embodiment of the present invention;
图2为本发明实施例提供的一种特征管理结构示意图;2 is a schematic diagram of a feature management structure provided by an embodiment of the present invention;
图3为本发明实施例提供的一种特征表业务层级管理示意图;3 is a schematic diagram of service level management of a feature table provided by an embodiment of the present invention;
图4为本发明实施例提供的一种特征依赖树示意图;Figure 4 is a schematic diagram of a feature dependency tree provided by an embodiment of the present invention;
图5为本发明实施例提供的一种特征处理流水线的流程示意图;5 is a schematic flowchart of a feature processing pipeline provided by an embodiment of the present invention;
图6为本发明实施例提供的一种适用于机器学习的特征处理方法的流程示意图;6 is a schematic flowchart of a feature processing method suitable for machine learning according to an embodiment of the present invention;
图7为本发明实施例提供的一种适用于机器学习的特征处理装置的结构示意图;FIG. 7 is a schematic structural diagram of a feature processing device suitable for machine learning according to an embodiment of the present invention;
图8为本发明实施例提供的一种计算机设备的结构示意图。FIG. 8 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及有益效果更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and beneficial effects of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
为了便于对说明书中的实施例的理解,在此首先进行部分名词的解释。In order to facilitate the understanding of the embodiments in the specification, some nouns are explained here first.
特征工程:从数据中获取、整理、加工出计算机程序可以理解和方便处理的特征的过程,主要用途是给机器学习提供训练、评估和预测的输入数据。Feature engineering: The process of obtaining, sorting, and processing features that can be understood and easily processed by computer programs from data. The main purpose is to provide input data for training, evaluation, and prediction for machine learning.
机器学习:机器学习是指计算机程序一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的过程。Machine learning: Machine learning refers to the process of automatically analyzing and obtaining laws from data, such as computer programs, and using the laws to predict unknown data.
归一化:数值映射到[0,1]区间的过程。Normalization: The process of mapping a value to the interval [0,1].
缺失值处理:特征数据缺失时的处理方式,如填充为0等。Missing value processing: the processing method when feature data is missing, such as filling with 0.
自然特征:人类可以理解的特征。Natural features: features that humans can understand.
机器特征:机器学习算法处理的特征。Machine features: features processed by machine learning algorithms.
独热码:即One Hot编码,将特征的多个值映射为多个比特;特征值对应的比特为1,其他比特为0。One hot code: One Hot code, which maps multiple values of a feature into multiple bits; the bit corresponding to the feature value is 1, and the other bits are 0.
拓扑排序:一种排序算法,该算法将没有依赖的元素排在前面。Topological sorting: A sorting algorithm that ranks the elements that are not dependent on first.
入度:有向图中某个节点,指向该节点的边的个数。In-degree: a node in a directed graph, the number of edges pointing to the node.
出度:有向图中某个节点,指向其他节点的边的个数。Out-degree: the number of edges that point to other nodes in a node in a directed graph.
现有技术中的机器学习,通常需要一些训练数据,这些训练数据是通过特征工程确定的,但是现有技术中通常利用Hive数据仓库存储特征,数据仓库可以提供SQL加工特征以及存储特征的能力。但是现有技术中没有统一管理特征以及特征加工的逻辑,无法清晰表达特征之间的依赖关系,为特征增加、删除、维护等带来不便。Machine learning in the prior art usually requires some training data, which is determined through feature engineering. However, in the prior art, the Hive data warehouse is usually used to store features, which can provide SQL processing features and the ability to store features. However, in the prior art, there is no unified management feature and feature processing logic, and the dependency relationship between features cannot be clearly expressed, which brings inconvenience to feature addition, deletion, and maintenance.
基于现有技术中存在的问题,本发明实施例提供一种适用于机器学习的特征处理方法,具体如图1所示,包括以下步骤:Based on the problems in the prior art, an embodiment of the present invention provides a feature processing method suitable for machine learning, as shown in FIG. 1 specifically, including the following steps:
步骤S101,在本发明实施例中,若获取了数据对应的特征处理请求后,就通过与特征处理请求相关的特征组成的特征池来进行特征处理。这些特征是以特征表的形式进行定义的,而特征表至少由特征列表、所属特征库、依赖特征表、所属业务、特征加工逻辑构成,所述特征列表中包括至少一个特征,所述依赖特征表用于记录与各特征表具有依赖关系的其它特征表,所述特征处理请求中包括需要处理的特征。Step S101, in the embodiment of the present invention, if the feature processing request corresponding to the data is obtained, the feature processing is performed through the feature pool composed of the features related to the feature processing request. These features are defined in the form of a feature table, and the feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list includes at least one feature, and the dependent feature The table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed.
具体的,在本发明实施例中,数据对应的特征处理请求可以是针对一些特征以及相关数据进行抽取加工的请求,例如,适合老年人居住的小区、适合上班族居住的小区等等。一般来说都是通过获取原始数据,在上一示例中,原始数据中可以包含BD地图、SG地图等提供的信息。然后利用数据处理技术,从这些数据中获取、处理和提取有意义的特征和属性,在上一示例中,特征和属性可以是社区设施信息、交通是否便利等等,最后,通常利用统计模型或机器学习模型等技术对这些特征进行建模的过程。我们可以将特征处理的过程分为两个阶段,第1个阶段将原始数据加工成自然特征,自然特征关注特征本身的含义,如客户的年龄、职业、年收入,公司的职工规模、办公地点等。有的自然特征可以从原始数据直接得到,有的特征则需要经过复杂的加工逻辑得到。第2个阶段,将自然特征加工成机器特征,机器特征的加工方式依赖于机器学习算法的输入要求,不同的算法要求的加工方式不一 样。比如深度学习算法往往需要将目录属性加工成独热码,而决策树算法则可以直接处理目录属性。Specifically, in the embodiment of the present invention, the feature processing request corresponding to the data may be a request for extracting and processing some features and related data, for example, a residential area suitable for the elderly, a residential area suitable for office workers, and so on. Generally speaking, the original data is obtained. In the previous example, the original data can include information provided by BD maps, SG maps, etc. Then use data processing technology to obtain, process, and extract meaningful features and attributes from these data. In the previous example, the features and attributes can be community facility information, traffic convenience, etc. Finally, statistical models or The process of modeling these features by technologies such as machine learning models. We can divide the feature processing process into two stages. In the first stage, the original data is processed into natural features. The natural features focus on the meaning of the feature itself, such as the age, occupation, annual income of the customer, the size of the company’s employees, and office location. Wait. Some natural features can be obtained directly from the original data, and some features need to be obtained through complex processing logic. In the second stage, natural features are processed into machine features. The processing method of machine features depends on the input requirements of machine learning algorithms. Different algorithms require different processing methods. For example, deep learning algorithms often need to process directory attributes into one-hot codes, while decision tree algorithms can directly process directory attributes.
在本发明实施例中,特征是通过特征表存储在数据库中的,如图2所示,数据库K中存在多个特征表t,特征表t中又包括多个自然特征f。In the embodiment of the present invention, the features are stored in the database through the feature table. As shown in FIG. 2, there are multiple feature tables t in the database K, and the feature table t includes multiple natural features f.
需要说明的是,在本发明实施例中的特征库可以对应于数据仓库中的库,也可以不对应数据仓库中的库,同样的,本发明实施例中的特征表可以对应于数据仓库中的表,也可以不对应数据仓库中的表,逻辑上并没有依赖关系。It should be noted that the feature library in the embodiment of the present invention may correspond to the library in the data warehouse, or may not correspond to the library in the data warehouse. Similarly, the feature table in the embodiment of the present invention may correspond to the library in the data warehouse. The table may not correspond to the table in the data warehouse, and there is no logical dependency.
在本发明实施例中,特征表中除了包括的多个特征,是通过特征列表的形式进行定义的,也就是说,每个特征表中包括了特征列表的部分,该特征列表的部分至少包括一个特征,该特征在特征列表中的表示,可以如表1所示。In the embodiment of the present invention, the multiple features included in the feature table are defined in the form of a feature list, that is, each feature table includes part of the feature list, and the part of the feature list includes at least A feature, the representation of the feature in the feature list, can be as shown in Table 1.
表1Table 1
特征标识Feature identification 中文名Chinese name 英文名English name 数据类型type of data 描述description 属性Attributes
当然了,表1只是一种特征的标识方法,在表1中还可以删除或者增加特征的表示元素。Of course, Table 1 is only a feature identification method. In Table 1, feature representation elements can also be deleted or added.
在本发明实施例中,除了特征列表外,特征表中还包括依赖特征表,例如特征表中包括特征A,特征A与特征B有依赖关系,特征B属于特征表B,所以在依赖特征表中就包括特征表B。In the embodiment of the present invention, in addition to the feature list, the feature table also includes a dependent feature table. For example, the feature table includes feature A, and feature A has a dependent relationship with feature B. Feature B belongs to feature table B, so in the dependent feature table Feature table B is included in.
在发明实施例中,特征表还包括所属特征库、所属业务以及特征加工逻辑部分内容。所属特征库指的是特征表属于的是哪个库,而所属业务,则指的是特征表中的特征是属于哪种业务。In the embodiment of the invention, the feature table also includes part of the content of the belonging feature library, the belonging business and the feature processing logic. The belonging feature library refers to which library the feature table belongs to, and the belonging service refers to which service the feature in the feature table belongs to.
在本发明实施例中,可以定义三种业务,一种是操作数据层业务,可以理解为用户的输入信息,第二种是公共维度模型业务,也就是将用户的输入信息进行模型加工或者判断得到的特征,第三种是应用数据层业务,也就是将特征直接应用在某些应用中。示例性的,如图3所示,本发明实施例提供一种通用的业务层级划分方式,当然也可以有其他的划分方法。业务层级可以看做特征表的一个标记,同一个特征表可以有多个相同层级的业务标记, 比如特征表3:应用1,应用2。但同一张特征表上,不会存在跨级的业务层级标记,比如不允许存在特征表1:模型1,应用2。In the embodiment of the present invention, three types of services can be defined, one is the operation data layer service, which can be understood as the user's input information, and the second is the public dimensional model service, that is, the user's input information is modeled or judged The third type of features obtained is the application data layer business, that is, the feature is directly applied to some applications. Exemplarily, as shown in FIG. 3, the embodiment of the present invention provides a general service level division method, although other division methods are also possible. The service level can be regarded as a mark of the feature table, and the same feature table can have multiple service marks of the same level, such as feature table 3: application 1, application 2. However, on the same feature table, there will be no cross-level business-level marks. For example, feature table 1: Model 1, Application 2 is not allowed.
在本发明实施例中,加工逻辑包含加工程序和程序配置;加工程序可以是SQL语句、或其他可以在特定环境运行的程序,运行之前程序配置必需已经完成。需要说明的是,加工程序只负责将特征加工出来,不关心目标特征数据如何保存,如加工程序如果是sql,那不会包含类似insert into[目标表]或者insert overwrite[目标表]等逻辑;相反,特征的保存行为由运行时的系统控制和跟踪。In the embodiment of the present invention, the processing logic includes a processing program and a program configuration; the processing program can be an SQL statement or other programs that can run in a specific environment, and the program configuration must be completed before running. It should be noted that the processing program is only responsible for processing the features, and does not care how to save the target feature data. For example, if the processing program is SQL, it will not contain logic like insert into [target table] or insert overwrite [target table]; On the contrary, the saving behavior of features is controlled and tracked by the runtime system.
步骤S102,根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,并根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径。Step S102: Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the current non-dependent relationship according to the feature dependency, and use the feature table of the current non-dependent relationship as a parallel child The task is added to the feature processing path.
在本发明实施例中,通过特征池中的各个特征以及需要处理的特征之间的依赖关系,可以根据该依赖关系确定当前没有依赖的关系的特征表,可以将当前没有依赖的关系的特征表作为并行处理的子任务加入特征处理路径,如此,能够确定特征处理路径,进而能够提高特征处理效率,也能够便于特征管理。In the embodiment of the present invention, through the dependency relationship between each feature in the feature pool and the features that need to be processed, the feature table of the relationship that is currently not dependent can be determined according to the dependence relationship, and the feature table of the relationship that is currently not dependent The feature processing path is added as a subtask of parallel processing. In this way, the feature processing path can be determined, thereby improving the feature processing efficiency and facilitating feature management.
在本发明实施例中,可以通过逐层确定特征依赖关系,例如,以需要处理的特征为特征A,特征A保存在特征表1、特征表2以及特征表3中;而特征表1与特征4有依赖关系,特征表2与特征表5有依赖关系,特征表3与特征表6有依赖关系,所以在处理特征A时,需要首先加工特征表4、特征表5以及特征表6,然后再加工特征表1、特征表2以及特征表3。In the embodiment of the present invention, the feature dependency relationship can be determined layer by layer, for example, the feature to be processed is feature A, and feature A is stored in feature table 1, feature table 2, and feature table 3; and feature table 1 and feature 4 has a dependency relationship, feature table 2 has a dependency relationship with feature table 5, feature table 3 has a dependency relationship with feature table 6, so when processing feature A, you need to process feature table 4, feature table 5, and feature table 6 first, and then Reprocessing feature table 1, feature table 2, and feature table 3.
可选的,在本发明实施例中,上述方法可以用简单的拓扑排序方法来确定,也就是说,通过需要处理的特征进行依赖排序,最终得到加工路径。Optionally, in the embodiment of the present invention, the foregoing method may be determined by a simple topological sorting method, that is, the processing path is finally obtained by performing dependent sorting on the features to be processed.
可选的,在本发明实施例中,为了能够清晰的显示特征加工路径,便于快速特征处理,以需要处理的特征作为根节点,将与根节点具有直接依赖关系或者间接依赖关系的特征表作为上层节点,构建特征依赖树。也就是说, 将需要处理的特征作为根节点,然后逐步向上层建立节点,形成树形依赖关系。Optionally, in the embodiment of the present invention, in order to clearly display the feature processing path and facilitate rapid feature processing, the feature that needs to be processed is used as the root node, and the feature table that has a direct or indirect dependence relationship with the root node is used as The upper node builds a feature dependency tree. In other words, take the feature to be processed as the root node, and then gradually build up nodes to the upper level to form a tree-like dependency.
示例性的,以需要处理的特征为特征A,特征A保存在特征表1、特征表2以及特征表3中;而特征表1与特征4有依赖关系,特征表2与特征表5有依赖关系,特征表3与特征表6有依赖关系,而特征表4、特征表5以及特征表6是通过加工原始表1、原始表2以及原始表3中的特征得到的,则形成的依赖树如图4所示。Exemplarily, the feature to be processed is feature A, and feature A is stored in feature table 1, feature table 2, and feature table 3; while feature table 1 has a dependency relationship with feature 4, and feature table 2 has a dependency with feature table 5 Relationship, feature table 3 and feature table 6 have a dependency relationship, and feature table 4, feature table 5, and feature table 6 are obtained by processing the features in original table 1, original table 2, and original table 3, which forms a dependency tree As shown in Figure 4.
在确定了依赖树后,就能够通过拓扑排序快速的确定特征之间的加工序列。After the dependency tree is determined, the processing sequence between features can be quickly determined through topological sorting.
可选的,在本发明实施例中,还可以选择将一个批次的特征进行并行处理,来提高特征处理的效率,例如,上述示例中的,原始表1、原始表2以及原始表3可以为一个批次,之间没有依赖关系,例如,原始表1中存储BD地图提供的信息,该BD地图提供的信息中医院的信息较为完整;原始表2中存储SG地图提供的信息,该SG地图提供的信息中学校的信息较为完整;原始表3中存储WW地图提供的信息,该WW地图提供的信息中超市的信息较为完整。如此,原始表1、原始表2、原始表3可以同时进行特征加工,得到特征表4,即,各医院的具体信息,是公立或私立,是否是三甲医院等;特征表5,即,各学校的具体信息,是公立或私立,是大学、中学、小学,是否是国家重点学校等;以及特征表6,即,各超市的具体信息,侧重于生鲜还是日用品等,服务质量和商品质量的星级等。进一步地,特征表4、特征表5以及特征表6可以同时进行加工得到特征表1,即,三甲医院附近的小区信息等、特征表2、211大学附近的小区信息等、以及特征表3,即,五星级服务质量和商品质量的超市附近的小区信息等。如此,可以确定需要处理的特征A,即适合老年人居住的小区。Optionally, in the embodiment of the present invention, one batch of features can be processed in parallel to improve the efficiency of feature processing. For example, in the above example, original table 1, original table 2, and original table 3 can be For one batch, there is no dependency relationship. For example, the information provided by the BD map is stored in the original table 1, and the information provided by the BD map is relatively complete; the information provided by the SG map is stored in the original table 2. In the information provided by the map, the school information is relatively complete; the original table 3 stores the information provided by the WW map, and the supermarket information is relatively complete in the information provided by the WW map. In this way, the original table 1, the original table 2, and the original table 3 can be processed at the same time to obtain the characteristic table 4, that is, the specific information of each hospital is public or private, whether it is a tertiary hospital, etc.; the characteristic table 5, namely, each The specific information of the school is public or private, whether it is a university, middle school, or elementary school, whether it is a national key school, etc.; and feature table 6, that is, the specific information of each supermarket, focusing on fresh food or daily necessities, service quality and product quality The stars and so on. Further, feature table 4, feature table 5, and feature table 6 can be processed at the same time to obtain feature table 1, that is, the information of the neighborhood near the top three hospitals, the feature table 2, the neighborhood information of the 211 university, and the feature table 3. That is, information on the neighborhoods near the supermarket with five-star service quality and product quality. In this way, it is possible to determine the feature A that needs to be processed, that is, the residential area suitable for the elderly.
可选的,在本发明实施例中,提出一种特征加工路径生成算法,具体包括:Optionally, in the embodiment of the present invention, a feature processing path generation algorithm is proposed, which specifically includes:
(1)初始化加工序列R,初始化的序列R为空;初始化集合S,S为所有特征表,在上一个示例中,包含特征表1、特征表2、特征表3、特征表4、特征表5、特征表6;初始化临时集合C,临时集合C为所有的原始数据表,在上一个示例中,包含原始表1、原始表2、原始表3;(1) Initialize the processing sequence R, the initialized sequence R is empty; the initialization set S, S is all feature tables, in the previous example, it contains feature table 1, feature table 2, feature table 3, feature table 4, feature table 5. Feature table 6; Initialize temporary set C, which is all original data tables. In the previous example, it contains original table 1, original table 2, and original table 3;
(2)当集合C为非空时,也就是原始数据表不为空时,遍历集合S中所有特征表,并标记当前遍历到的表为Si;(2) When the set C is non-empty, that is, when the original data table is not empty, traverse all the feature tables in the set S and mark the table currently traversed as Si;
(3)遍历集合C中所有表,这些表可能是原始表或特征表,记当前遍历到的表为Cj;(3) Traverse all tables in set C. These tables may be original tables or feature tables, and record the current traversed table as Cj;
(4)若确定Cj是Si的依赖表,则依赖图中存在Cj指向Si的边,如图4所示,则删除原始表1指向特征表4的这条边、原始表2指向特征表5的这条边、原始表3指向特征表6的这条边,可以理解为将Cj出度Si的部分删除;(4) If it is determined that Cj is a dependency table of Si, there is an edge where Cj points to Si in the dependency graph, as shown in Figure 4, delete the edge from original table 1 to feature table 4, and original table 2 to feature table 5. This side of the original table 3 points to the side of the feature table 6, which can be understood as deleting the part of Cj out of degree Si;
(5)循环第(3)步,然后执行步骤(2);(5) Cycle step (3), and then execute step (2);
(6)将当前集合C中的非原始表(C1,C2,...)取出,并组成一个并行子任务task(C1|C2|..),添加到加工序列R的尾部;清空集合C为空集合;(6) Take out the non-original tables (C1, C2,...) in the current set C, and form a parallel subtask task (C1|C2|..), add it to the end of the processing sequence R; empty the set C Empty collection
(7)遍历所有S集合中的特征表,找出所有入度为0的表,将它们从S中删除,并加入到集合C中;(7) Traverse all the feature tables in the S set, find all the tables with an in-degree of 0, delete them from S, and add them to the set C;
(8)返回第(2)步;(8) Return to step (2);
(9)若确定集合S为非空,提示出现循环依赖并退出路径计算;(9) If it is determined that the set S is not empty, it prompts that there is a circular dependency and exits the path calculation;
(10)程序结束,序列R中所有子任务即为抽取任务的加工路径。(10) At the end of the program, all subtasks in the sequence R are the processing paths of the extraction tasks.
为了更好的理解该方法,以图4中的特征依赖树为示例进行说明,首先初始化加工序列R为空,然后初始化集合S为所有特征表,初始化临时集合C,为所有的原始数据表。In order to better understand the method, take the feature dependency tree in Figure 4 as an example to illustrate. First, initialize the processing sequence R to be empty, then initialize the set S to all feature tables, and initialize the temporary set C to all original data tables.
在第一次循环时,集合C不为空,当前遍历的为S1以及C1,例如S1为特征表4,C1为原始表1,原始表1是特征表4的依赖表,删除了特征表的入度。然后继续遍历,S2为特征表6,C2为原始表2,这两个特征表没有依赖关系,继续遍历,S3为特征表5,C3为原始表2,原始表2是特征表4的依赖表,删除了特征表5的入度,直到将特征表6的入度也删除;然后更新C 表,此时C表中就包括了特征表,使用上述继续删除特征表的入度,并将删除入度的特征表组成一个并行子任务,然后清空C表,继续上述步骤,直到将所有的特征表的入度删除,组成了多个并行子任务。In the first cycle, the set C is not empty, the current traversal is S1 and C1, for example, S1 is feature table 4, C1 is original table 1, original table 1 is the dependency table of feature table 4, and the feature table is deleted Into the degree. Then continue to traverse, S2 is feature table 6, C2 is original table 2, these two feature tables have no dependency relationship, continue to traverse, S3 is feature table 5, C3 is original table 2, original table 2 is the dependency table of feature table 4 , Delete the in-degree of feature table 5 until the in-degree of feature table 6 is also deleted; then update the C table, at this time the C table includes the feature table, use the above to continue to delete the in-degree of the feature table, and delete The in-degree feature table forms a parallel subtask, then clears the C table, and continues the above steps until the in-degrees of all the feature tables are deleted, forming multiple parallel subtasks.
也就是说,在本发明实施例中,需要找出当前没有依赖的表的集合,并删除依赖图中与该集合中表的关联,以此生成下一批没有依赖的表,直到将所有表加入加工序列。与标准拓扑排序的区别在于,算法的每一步都将当前没有依赖的表组成一个并行的子任务,并行运行可以加快整体的运行效率。That is to say, in the embodiment of the present invention, it is necessary to find out the set of tables that are not currently dependent, and delete the association between the dependence graph and the tables in the set, so as to generate the next batch of non-dependent tables until all tables are Join the processing sequence. The difference from standard topological sorting is that each step of the algorithm composes tables that are not currently dependent on a parallel subtask, and parallel operation can speed up the overall operating efficiency.
步骤S103,根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理,以根据并行处理后的特征得到训练数据。Step S103: Perform parallel feature processing on the feature table with no dependent relationships according to the feature processing path, so as to obtain training data based on the parallel processed features.
在本发明实施例中,通过确定的特征处理路径可以将需要的特征进行加工处理,在加工完成后,即完成了第一个步骤,将原始特征加工成自然特征的过程,然后需要将自然特征加工为机器特征。In the embodiment of the present invention, the required feature can be processed through the determined feature processing path. After the processing is completed, the first step is completed, the process of processing the original feature into a natural feature, and then the natural feature needs to be processed. Processing is a machine feature.
在本发明实施例中,可以通过多个连续的处理步骤得到机器特征,并且在每个处理步骤后,都可以将处理结果进行保存,便于后续的特征利用。例如说,在本发明实施例中,如将目录属性客户是否抽烟:是|否进行数值化的过程中,需要记录目录类别和数值的对应关系,例如,抽烟->1,不抽烟->0,其他如均值方差归一化,需要记录特征的均值和方差,所以在本发明实施例中,可以经均值步骤以及方差步骤后,得到机器特征,即1或者0。In the embodiment of the present invention, machine features can be obtained through multiple consecutive processing steps, and after each processing step, the processing results can be saved to facilitate subsequent feature utilization. For example, in the embodiment of the present invention, such as whether the catalog attribute customer smokes: Yes|No In the process of digitizing, it is necessary to record the corresponding relationship between the catalog category and the value, for example, smoking -> 1, no smoking -> 0 Others, such as mean variance normalization, need to record the mean value and variance of the feature. Therefore, in the embodiment of the present invention, the machine feature, namely 1 or 0, can be obtained after the average step and the variance step.
在本发明实施例中,经过多个连续的处理步骤得到机器特征的过程也可以称为机器特征加工流水线,如图5所示,自然特征到机器特征的加工是以单个特征的维度进行的。多个特征也可以共享一个流水线。多个处理步骤组成了一个处理流水线,流水线上的步骤接收上一个步骤的输出,处理后输出到下一个步骤。每个步骤可以输出一个步骤状态,也可以没有。In the embodiment of the present invention, the process of obtaining machine features through multiple consecutive processing steps may also be referred to as a machine feature processing pipeline. As shown in FIG. 5, the processing from natural features to machine features is performed in the dimension of a single feature. Multiple features can also share a pipeline. Multiple processing steps form a processing pipeline, and the steps on the pipeline receive the output of the previous step, and output to the next step after processing. Each step can output a step status or not.
流水线中每个步骤需要支持处理一个或多个特征。因为输入流水线时可能只有一个特征,但中间的某个步骤,可能将一个特征变成多个特征。比如独热码将特征的每一种取值新增一个特征。如将客户是否抽烟这个特征,处理成客户抽烟、客户不抽烟2个特征。Each step in the pipeline needs to support processing one or more features. Because there may only be one feature when input to the pipeline, but a certain step in the middle may turn one feature into multiple features. For example, the one-hot code adds a feature to each value of the feature. For example, the feature of whether a customer smokes is treated as two features: the customer smokes and the customer does not smoke.
也就是说,通过加工流水线,可以将特征处理的中间状态进行保存,并且可以通过自定义的方式,设置中间过程,就例如上述示例中的将归一化的过程,设置为均值和方差两个步骤,则可以保存归一化过程中的特征,以便特征复用。That is to say, through the processing pipeline, the intermediate state of the feature processing can be saved, and the intermediate process can be set in a custom way. For example, the normalization process in the above example is set to mean and variance. Step, you can save the features in the normalization process for feature reuse.
为了更好的解释本申请实施例,下面结合具体的实施场景描述本申请实施例提供的一种适用于机器学习的特征处理方法,该方法用于提取特征S,特征S位于特征表1中,特征表1与特征表2、特征表3、特征表4有关联关系,特征表2与特征表5、特征表6有关联关系,具体如图6所示:In order to better explain the embodiments of the present application, a feature processing method suitable for machine learning provided by the embodiments of the present application will be described below in conjunction with specific implementation scenarios. The method is used to extract feature S, which is located in feature table 1. Feature table 1 has an associated relationship with feature table 2, feature table 3, and feature table 4, and feature table 2 has an associated relationship with feature table 5 and feature table 6, as shown in Figure 6:
步骤S601,获取数据对应的特征处理请求;Step S601: Obtain a characteristic processing request corresponding to the data;
步骤S602,将特征表1、特征表2、特征表3、特征表4、特征表5以及特征表6中的特征构建特征池;Step S602: construct a feature pool with features in feature table 1, feature table 2, feature table 3, feature table 4, feature table 5, and feature table 6;
步骤S603,将特征池中的特征构建依赖树,该依赖树可以体现为,特征S为根节点,根节点上层节点为特征表1,特征表1的上层节点为特征表2、特征表3、特征表4,特征表2的上层节点为特征表5、特征表6;Step S603: Construct a dependency tree for the features in the feature pool. The dependency tree can be embodied as: feature S is the root node, the upper node of the root node is feature table 1, and the upper nodes of feature table 1 are feature table 2, feature table 3, Feature table 4, the upper nodes of feature table 2 are feature table 5 and feature table 6;
步骤S604,找出当前没有依赖的表的集合,并删除依赖图中与该集合中表的关联,以此生成下一批没有依赖的表,直到将所有表加入加工序列,得到加工序列,具体为特征表5、特征表6>特征表2、特征表3,特征表4>特征表1;Step S604: Find out the set of tables that are not currently dependent, and delete the association between the dependence graph and the tables in the set, so as to generate the next batch of non-dependent tables, until all tables are added to the processing sequence to obtain the processing sequence. Feature Table 5, Feature Table 6> Feature Table 2, Feature Table 3, Feature Table 4> Feature Table 1;
步骤S605,根据加工序列进行特征加工,得到特征S;Step S605, performing feature processing according to the processing sequence to obtain feature S;
步骤S606,将特征S经过多个步骤得到机器特征T,并保存多个步骤的特征结果。In step S606, the feature S is passed through multiple steps to obtain the machine feature T, and the feature results of the multiple steps are saved.
基于相同的技术构思,本申请实施例提供了一种适用于机器学习的特征处理装置,如图7所示,该装置700包括:Based on the same technical concept, an embodiment of the present application provides a feature processing device suitable for machine learning. As shown in FIG. 7, the device 700 includes:
获取单元701,用于获取数据对应的特征处理请求后,根据各特征表中的各个特征构建特征池,所述特征表至少由特征列表、所属特征库、依赖特征表、所属业务、特征加工逻辑构成,所述特征列表中包括至少一个特征,所 述依赖特征表用于记录与各特征表具有依赖关系的其它特征表,所述特征处理请求中包括需要处理的特征;The obtaining unit 701 is configured to obtain a feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list includes at least one feature, the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes features that need to be processed;
特征处理路径确定单元702,用于根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,并根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径;The feature processing path determination unit 702 is configured to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is not currently dependent according to the feature dependency relationship, and to change the relationship that is currently not dependent The feature table of is added to the feature processing path as a parallel subtask;
特征处理单元703,用于根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理,以根据并行处理后的特征得到训练数据。The feature processing unit 703 is configured to perform parallel feature processing on a feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.
可选的,所述特征处理路径确定单元702具体用于:Optionally, the characteristic processing path determining unit 702 is specifically configured to:
以所述需要处理的特征作为根节点,将与根节点具有直接依赖关系或者间接依赖关系的特征表作为上层节点,构建特征依赖树。The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
可选的,所述特征处理路径确定单元702具体用于:Optionally, the characteristic processing path determining unit 702 is specifically configured to:
确定所述特征依赖树中当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第一加工路径中,删除当前没有依赖的关系的特征表与所述特征依赖树中其它特征表的关联,返回确定所述特征依赖树中当前没有依赖的关系的特征表的步骤,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第二加工路径中,直到将所述依赖树中所有特征表加入到特征处理路径表中。Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
可选的,所述特征处理单元703还用于:Optionally, the feature processing unit 703 is further configured to:
将处理后的特征经过多个连续的处理步骤得到机器特征。The processed features are passed through multiple consecutive processing steps to obtain machine features.
基于相同的技术构思,本申请实施例提供了一种计算机设备,如图8所示,包括至少一个处理器801,以及与至少一个处理器连接的存储器802,本申请实施例中不限定处理器801与存储器802之间的具体连接介质,图8中处理器801和存储器802之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。Based on the same technical concept, an embodiment of the present application provides a computer device, as shown in FIG. 8, including at least one processor 801 and a memory 802 connected to the at least one processor. The embodiment of the present application does not limit the processor For the specific connection medium between the 801 and the memory 802, the connection between the processor 801 and the memory 802 in FIG. 8 is taken as an example. The bus can be divided into address bus, data bus, control bus, etc.
在本申请实施例中,存储器802存储有可被至少一个处理器801执行的指令,至少一个处理器801通过执行存储器802存储的指令,可以执行前述 的适用于机器学习的特征处理方法中所包括的步骤。In the embodiment of the present application, the memory 802 stores instructions that can be executed by at least one processor 801. By executing the instructions stored in the memory 802, the at least one processor 801 can execute the aforementioned feature processing methods suitable for machine learning. A step of.
其中,处理器801是计算机设备的控制中心,可以利用各种接口和线路连接终端设备的各个部分,通过运行或执行存储在存储器802内的指令以及调用存储在存储器802内的数据,从而获得客户端地址。可选的,处理器801可包括一个或多个处理单元,处理器801可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器801中。在一些实施例中,处理器801和存储器802可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。Among them, the processor 801 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the terminal equipment, and obtain customers by running or executing instructions stored in the memory 802 and calling data stored in the memory 802. End address. Optionally, the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor. The application processor mainly processes the operating system, user interface, and application programs. The adjustment processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 801. In some embodiments, the processor 801 and the memory 802 may be implemented on the same chip, and in some embodiments, they may also be implemented on separate chips.
处理器801可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processor 801 may be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
存储器802作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器802可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random Access Memory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器802是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器802还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。As a non-volatile computer-readable storage medium, the memory 802 can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), magnetic memory, disk , CD, etc. The memory 802 is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 802 in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.
基于相同的技术构思,本申请实施例提供了一种计算机可读存储介质,其存储有可由计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得所述计算机设备执行适用于机器学习的特征处理方法的步骤。Based on the same technical concept, the embodiments of the present application provide a computer-readable storage medium that stores a computer program that can be executed by a computer device. When the program runs on the computer device, the computer device can execute The steps of the feature processing method of machine learning.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc. A medium that can store program codes.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of this application is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application essentially or the part that contributes to the prior art can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (10)

  1. 一种适用于机器学习的特征处理方法,其特征在于,所述方法包括:A feature processing method suitable for machine learning, characterized in that the method includes:
    获取数据对应的特征处理请求后,根据各特征表中的各个特征构建特征池,所述特征表至少由特征列表、所属特征库、依赖特征表、所属业务、特征加工逻辑构成,所述特征列表中包括至少一个特征,所述依赖特征表用于记录与各特征表具有依赖关系的其它特征表,所述特征处理请求中包括需要处理的特征;After obtaining the feature processing request corresponding to the data, construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, a feature library, a dependent feature table, a business, and feature processing logic. The feature list At least one feature is included in the feature table, the dependency feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature that needs to be processed;
    根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,并根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径;Determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and add the feature table of the relationship that currently has no dependent relationship as a parallel subtask to the feature Processing path
    根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理,以根据并行处理后的特征得到训练数据。According to the feature processing path, parallel feature processing is performed on the feature table with no dependent relationship, so as to obtain training data according to the features after the parallel processing.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,包括:The method according to claim 1, wherein the determining the feature dependency relationship according to the feature to be processed and each feature in the feature pool comprises:
    以所述需要处理的特征作为根节点,将与根节点具有直接依赖关系或者间接依赖关系的特征表作为上层节点,构建特征依赖树。The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
  3. 根据权利要求2所述的方法,其特征在于,所述根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径,包括:The method according to claim 2, wherein the determining a feature table of a relationship that is not currently dependent according to the feature dependency relationship, and adding the feature table of a relationship that is currently not dependent as a parallel subtask into the feature processing path, comprises:
    确定所述特征依赖树中当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第一加工路径中,删除当前没有依赖的关系的特征表与所述特征依赖树中其它特征表的关联,返回确定所述特征依赖树中当前没有依赖的关系的特征表的步骤,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第二加工路径中,直到将所述依赖树中所有特征表加入到特征处理路径表中。Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理后,还包括:The method according to claim 1, characterized in that, after performing parallel feature processing on a feature table with no dependent relationship currently based on the feature processing path, further comprising:
    将处理后的特征经过多个连续的处理步骤得到机器特征。The processed features are passed through multiple consecutive processing steps to obtain machine features.
  5. 一种适用于机器学习的特征处理装置,其特征在于,所述装置包括:A feature processing device suitable for machine learning, characterized in that the device comprises:
    获取单元,用于获取数据对应的特征处理请求后,根据各特征表中的各个特征构建特征池,所述特征表至少由特征列表、所属特征库、依赖特征表、所属业务、特征加工逻辑构成,所述特征列表中包括至少一个特征,所述依赖特征表用于记录与各特征表具有依赖关系的其它特征表,所述特征处理请求中包括需要处理的特征;The obtaining unit is used to obtain the feature processing request corresponding to the data, and construct a feature pool according to each feature in each feature table. The feature table is at least composed of a feature list, belonging feature library, dependent feature table, belonging business, and feature processing logic , The feature list includes at least one feature, the dependent feature table is used to record other feature tables that have a dependency relationship with each feature table, and the feature processing request includes the feature to be processed;
    特征处理路径确定单元,用于根据所述需要处理的特征以及所述特征池中各个特征确定特征依赖关系,并根据特征依赖关系确定当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入特征处理路径;The feature processing path determination unit is used to determine the feature dependency relationship according to the features to be processed and each feature in the feature pool, and to determine the feature table of the relationship that is currently not dependent on the feature dependency relationship, and to select the relationship that currently has no dependent relationship. The feature table is added to the feature processing path as a parallel subtask;
    特征处理单元,用于根据所述特征处理路径对当前没有依赖的关系的特征表进行并行特征处理,以根据并行处理后的特征得到训练数据。The feature processing unit is configured to perform parallel feature processing on the feature table with no dependent relationship according to the feature processing path, so as to obtain training data based on the parallel processed features.
  6. 根据权利要求5所述的装置,其特征在于,所述特征处理路径确定单元具体用于:The device according to claim 5, wherein the characteristic processing path determining unit is specifically configured to:
    以所述需要处理的特征作为根节点,将与根节点具有直接依赖关系或者间接依赖关系的特征表作为上层节点,构建特征依赖树。The feature to be processed is taken as the root node, and the feature table that has a direct dependency relationship or an indirect dependency relationship with the root node is taken as an upper node to construct a feature dependency tree.
  7. 根据权利要求6所述的装置,其特征在于,所述特征处理路径确定单元具体用于:The device according to claim 6, wherein the characteristic processing path determining unit is specifically configured to:
    确定所述特征依赖树中当前没有依赖的关系的特征表,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第一加工路径中,删除当前没有依赖的关系的特征表与所述特征依赖树中其它特征表的关联,返回确定所述特征依赖树中当前没有依赖的关系的特征表的步骤,将当前没有依赖的关系的特征表作为并行子任务加入到特征处理路径表中的第二加工路径中,直到将所述依赖树中所有特征表加入到特征处理路径表中。Determine the feature table of the currently non-dependent relationship in the feature dependency tree, add the feature table of the current non-dependent relationship as a parallel subtask to the first processing path in the feature processing path table, and delete the current non-dependent relationship The association of the feature table with other feature tables in the feature dependency tree, returning to the step of determining the feature table of the relationship that currently has no dependency in the feature dependency tree, and adding the feature table of the relationship that currently has no dependency as a parallel subtask to the feature Process the second processing path in the path table until all the feature tables in the dependency tree are added to the feature processing path table.
  8. 根据权利要求5所述的装置,其特征在于,所述特征处理单元还用于:The device according to claim 5, wherein the feature processing unit is further configured to:
    将处理后的特征经过多个连续的处理步骤得到机器特征。The processed features are passed through multiple consecutive processing steps to obtain machine features.
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至4任一项所述方法的步骤。A computer device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to implement any one of claims 1 to 4 The steps of the method.
  10. 一种计算机可读存储介质,其特征在于,其存储有可由计算机设备执行的计算机程序,当所述程序在计算机设备上运行时,使得计算机执行如权利要求1至4中任一项所述的方法。A computer-readable storage medium, characterized in that it stores a computer program that can be executed by a computer device, and when the program runs on the computer device, the computer executes the computer program described in any one of claims 1 to 4 method.
PCT/CN2020/095934 2019-06-26 2020-06-12 Feature processing method applicable to machine learning, and device WO2020259325A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910562484.1 2019-06-26
CN201910562484.1A CN110275889B (en) 2019-06-26 2019-06-26 Feature processing method and device suitable for machine learning

Publications (1)

Publication Number Publication Date
WO2020259325A1 true WO2020259325A1 (en) 2020-12-30

Family

ID=67963408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/095934 WO2020259325A1 (en) 2019-06-26 2020-06-12 Feature processing method applicable to machine learning, and device

Country Status (2)

Country Link
CN (1) CN110275889B (en)
WO (1) WO2020259325A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275889B (en) * 2019-06-26 2023-11-24 深圳前海微众银行股份有限公司 Feature processing method and device suitable for machine learning
CN111581305B (en) * 2020-05-18 2023-08-08 抖音视界有限公司 Feature processing method, device, electronic equipment and medium
CN111752967A (en) * 2020-06-12 2020-10-09 第四范式(北京)技术有限公司 SQL-based data processing method and device, electronic equipment and storage medium
CN111859928A (en) * 2020-07-30 2020-10-30 网易传媒科技(北京)有限公司 Feature processing method, device, medium and computing equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090037466A1 (en) * 2007-07-31 2009-02-05 Cross Micah M Method and system for resolving feature dependencies of an integrated development environment with extensible plug-in features
CN103019651A (en) * 2012-08-02 2013-04-03 青岛海信传媒网络技术有限公司 Parallel processing method and device for complex tasks
CN103645948A (en) * 2013-11-27 2014-03-19 南京师范大学 Dependency-based parallel computing method for intensive data
CN108537543A (en) * 2018-03-30 2018-09-14 百度在线网络技术(北京)有限公司 Method for parallel processing, device, equipment and the storage medium of block chain data
CN108595157A (en) * 2018-04-28 2018-09-28 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the storage medium of block chain data
CN110275889A (en) * 2019-06-26 2019-09-24 深圳前海微众银行股份有限公司 A kind of characteristic processing method and device suitable for machine learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014178829A1 (en) * 2013-04-30 2014-11-06 Hewlett-Packard Development Company, L.P. Dependencies between feature flags
US10666507B2 (en) * 2017-06-30 2020-05-26 Microsoft Technology Licensing, Llc Automatic reconfiguration of dependency graph for coordination of device configuration

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090037466A1 (en) * 2007-07-31 2009-02-05 Cross Micah M Method and system for resolving feature dependencies of an integrated development environment with extensible plug-in features
CN103019651A (en) * 2012-08-02 2013-04-03 青岛海信传媒网络技术有限公司 Parallel processing method and device for complex tasks
CN103645948A (en) * 2013-11-27 2014-03-19 南京师范大学 Dependency-based parallel computing method for intensive data
CN108537543A (en) * 2018-03-30 2018-09-14 百度在线网络技术(北京)有限公司 Method for parallel processing, device, equipment and the storage medium of block chain data
CN108595157A (en) * 2018-04-28 2018-09-28 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the storage medium of block chain data
CN110275889A (en) * 2019-06-26 2019-09-24 深圳前海微众银行股份有限公司 A kind of characteristic processing method and device suitable for machine learning

Also Published As

Publication number Publication date
CN110275889B (en) 2023-11-24
CN110275889A (en) 2019-09-24

Similar Documents

Publication Publication Date Title
WO2020259325A1 (en) Feature processing method applicable to machine learning, and device
US11379755B2 (en) Feature processing tradeoff management
US20220391763A1 (en) Machine learning service
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
US9053171B2 (en) Clustering data points
US11182691B1 (en) Category-based sampling of machine learning data
US10339465B2 (en) Optimized decision tree based models
US9519862B2 (en) Domains for knowledge-based data quality solution
JP2021518024A (en) How to generate data for machine learning algorithms, systems
CN112434024B (en) Relational database-oriented data dictionary generation method, device, equipment and medium
US9754015B2 (en) Feature rich view of an entity subgraph
AU2012217093B2 (en) Method, system and computer program to provide fares detection from rules attributes
US11379466B2 (en) Data accuracy using natural language processing
CN113760891B (en) Data table generation method, device, equipment and storage medium
CN107506484B (en) Operation and maintenance data association auditing method, system, equipment and storage medium
WO2019223104A1 (en) Method and apparatus for determining event influencing factors, terminal device, and readable storage medium
CN114385652A (en) Data blood relationship construction method and system, electronic device and storage medium
US20100106538A1 (en) Determining disaster recovery service level agreements for data components of an application
WO2023098034A1 (en) Business data report classification method and apparatus
CN113641654B (en) Marketing treatment rule engine method based on real-time event
CN115934161A (en) Code change influence analysis method, device and equipment
CN115543428A (en) Simulated data generation method and device based on strategy template
KR20190010091A (en) Anonymization Device for Preserving Utility of Data and Method thereof
CN113901046A (en) Virtual dimension table construction method and device
CN113641705A (en) Marketing disposal rule engine method based on calculation engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20832495

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20832495

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 05/04/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20832495

Country of ref document: EP

Kind code of ref document: A1