CN111832740A - Method for deriving machine learning characteristics from structured data in real time - Google Patents

Method for deriving machine learning characteristics from structured data in real time Download PDF

Info

Publication number
CN111832740A
CN111832740A CN201911393160.6A CN201911393160A CN111832740A CN 111832740 A CN111832740 A CN 111832740A CN 201911393160 A CN201911393160 A CN 201911393160A CN 111832740 A CN111832740 A CN 111832740A
Authority
CN
China
Prior art keywords
machine learning
data
feature
real time
structured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911393160.6A
Other languages
Chinese (zh)
Inventor
万晶
李学文
樊静文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Creditx Information Technology Co ltd
Original Assignee
Shanghai Creditx Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Creditx Information Technology Co ltd filed Critical Shanghai Creditx Information Technology Co ltd
Priority to CN201911393160.6A priority Critical patent/CN111832740A/en
Publication of CN111832740A publication Critical patent/CN111832740A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for deriving machine learning characteristics from structured data in real time, which comprises the following steps: defining a computer language comprising a plurality of computing functions for feature processing in machine learning; developing feature computation logic using the computer language; generating executable program code according to the feature calculation logic; the program code is executed to apply the feature computation logic to corresponding raw data to derive machine learning features. The invention has the advantages that the complexity and the development period of feature development are greatly reduced, and a modeling person of a data analysis background can flexibly and conveniently generate the required machine learning features.

Description

Method for deriving machine learning characteristics from structured data in real time
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a method for deriving characteristics for machine learning from structured data in real time.
Background
Machine learning is an artificial intelligence method, which uses probability and statistics to obtain a rule (generally called model) from data and uses the rule to reason unknown data. The machine learning has wide application fields, and is applied to finance, medicine, social public affairs and the like.
When solving an actual problem by machine learning, feature index processing or so-called feature index calculation is a very important step. Machine learning methods such as the commonly used xgboost, gbdt, lightgbm, etc. generally do not directly use original data (generally referred to as training data), but process the data into a form of feature index. The characteristic indexes are generally statistical derivatives of the original data or cross derivatives of a plurality of original data fields, so that information in the original data is fully mined, and the model effect generated by machine learning is optimal.
There is no fixed method for deriving features, so currently, in actual production, software code methods are mostly used for processing, and new code is written when new features are needed. The written code is abstracted into a relatively more universal module, and the written code can be repeatedly used for many times by using a configuration file to configure, so that the writing work of the code is reduced.
The current approach has two problems. Firstly, the feature derivation development work is large, more developers of the machine learning model are partial statistics and mathematic machines, additional development engineers are needed for assistance, and the development period is long; secondly, after the machine learning model is developed by a developer, production is required to be carried out and online, the machine learning model enters an online decision-making system, and feature processing work in the model development process cannot be reused at the position. Under the current environment that various businesses in the social industry are developed at a high speed, the requirements on the optimization and updating frequency of the model are higher and higher, and the original method is more and more difficult to meet the business requirements.
Disclosure of Invention
The invention provides a method for deriving the machine learning characteristics in real time for the structured data, and part of embodiments of the invention can combine the traditional SQL query method with the characteristic processing of machine learning, reduce the use threshold of user operation, use the same logic in offline and online processing, accelerate the development cycle and reduce the online test workload.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method of deriving machine learning features in real-time from structured data, the method comprising: defining a computer language comprising a plurality of computing functions for feature processing in machine learning; developing feature computation logic using the computer language; generating executable program code according to the feature calculation logic; the program code is executed to apply the feature computation logic to corresponding raw data to derive machine learning features.
Preferably, the computer language is SQL. Although technically they could be based on others, SQL is most familiar to model developers or analysts.
Preferably, said executing said program code to apply said feature computation logic to corresponding raw data comprises: loading the original data into a computer memory according to a predefined data format; and preprocessing the original data.
Preferably, the predefined data format comprises: field names, field types, and missing/outlier processing logic for the data.
Preferably, the pre-processing includes a plurality of processing methods, which are abstracted into configurable options.
Preferably, the feature computation logic comprises an extension function that processes the raw data first and an aggregation function that processes the raw data and/or an output result of the extension function second.
Preferably, the executable program code generated according to the feature computation logic uses regularization by a parser of an existing database, either yacc or in cooperation with corresponding grammar parsing.
Compared with the prior art, the invention has the beneficial effects that:
1. the complexity and the development period of feature development are greatly reduced, and a modeling worker of a data analysis background can flexibly and conveniently generate the required machine learning features;
2. the method has no dependence on an external system and technology, has completely the same operation method and steps under two scenes of off-line batch and on-line real-time computation, can reuse off-line computation logic to a production environment, and quickens the on-line development period of a model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating a class of computation functions of a feature query language according to an embodiment of the present invention.
FIG. 2 is a flow chart illustrating the operation of the process according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of the overall architecture and external interaction of the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
Referring to fig. 1-3, in one embodiment, the method comprises the following steps:
1. according to a common feature processing method in a machine learning process, some calculation subunits, namely calculation functions, are abstracted and summarized, and the calculation functions comprise statistics such as quantile, data cleaning and conversion such as date conversion and the like. These functions are added to the standard SQL specification and form an extended suite of SQL languages. This is the basis for defining the feature computation logic.
2. Analyzing the data to be processed, and formulating the field name, the field type and the missing value/abnormal value processing logic of the data. The logics are processed in a manual code mode in a common machine learning method, and the methods are abstracted into configurable options and are uniformly processed before the characteristics are calculated. Typical operations include setting certain values to null, replacing certain values outside of a business justified range, etc.
3. Feature computation logic is developed using feature definition SQL. The computation logic is to perform derivation computation on the fields in the original data by using a computation function. Two processes may be involved here: one is preprocessing, and some calculation functions are processing for converting and processing original data, such as simple arithmetic operations, cleaning of date format, etc., and the result of the processing is not the final result, which is called an expansion function; and another aggregation type function, wherein the result of the function is the final characteristic calculation result, namely the index variable used by machine learning. The input of the aggregation function may be the most primitive input data or the output of the preprocessing function. The method is divided into two stages for the following reasons: some complex operations may not result in a single pass, and some other operations may have some preprocessing steps in common, so that multiple aggregation functions may share the output of the same preprocessing function.
4. And processing the characteristic definition SQL by using a syntax parser to generate an executable data structure. Some existing database parser may be used here, but tools such as yacc may be used, and regularization may also be used, in conjunction with corresponding grammar parsing. In the course of the parsing process, an extended defined function needs to be added thereto. The SQL query language is eventually compiled here into truly executable program code structures.
5. And (3) loading the original data into the memory according to the format defined in the step (2), and processing missing values and abnormal values.
6. And applying the characteristic derivation logic to the data generated in the previous step, firstly operating a preprocessing function, then operating an aggregation function, and screening the result data if the SQL has corresponding definitions according to needs. The specific implementation of each function is referred to herein, and it is necessary to ensure that the logic of the implementation is identical to the logic used in the machine learning method. Finally, the characteristic variables which can be used for training the machine learning model are obtained.
In another embodiment, the method comprises the following steps:
1. an SQL-based feature definition language is defined. The basic format specification of the method is not different from that of standard ANSI SQL, but the method expands functions, and not only comprises traditional statistical calculation functions such as min/max/mean/sum/std/quantile, but also comprises calculation method functions commonly used in the field of machine learning for processing data, such as similarity of two texts calculated by similarity, change trend of running water data and the like. This step is to establish an initial specification, which is a one-time operation, and after being defined, the operation can be performed according to the specification without redefining each time, but then the language specification may be expanded as the business develops, and most typically, a new calculation function is added.
2. The format of the raw data is defined. The method comprises the field name, the field type, the missing value processing logic and the value range of original data of feature calculation.
3. Using the feature definition language in step 1, feature computation logic is defined according to the service data. The logic is written only by SQL language background, the threshold is very low, general data processing and calculating personnel have corresponding knowledge, and the logic can be written without special training.
4. The feature definition is parsed into a real executable data structure word by word. The data structure includes what pre-operations need to be performed on the data before computation (e.g., whether some values need to be removed, whether the data type needs to be converted, etc.), what logics need to be computed, what fields need to be computed, and whether the computed result needs to be converted. This step can be abstracted into a common syntactic and lexical parser, forming a reusable component. The steps of the loading feature definition SQL … … in fig. 2 assembled into an executable data structure may be understood as building an environment to implement the following extension functions and aggregation functions.
5. And (3) loading the original data into a computer memory operating area according to the format definition in the step (2).
6. And 4, calculating the loaded data in the memory according to the definition in the step 4 to obtain a final result.
Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.

Claims (7)

1. A method for deriving machine learning features in real-time from structured data, the method comprising:
defining a computer language comprising a plurality of computing functions for feature processing in machine learning;
developing feature computation logic using the computer language;
generating executable program code according to the feature calculation logic;
the program code is executed to apply the feature computation logic to corresponding raw data to derive machine learning features.
2. The method for deriving machine learning features from structured data in real time as claimed in claim 1 wherein the computer language is SQL.
3. The method of deriving machine learning features from structured data in real time as claimed in claim 2 wherein executing the program code to apply the feature computation logic to corresponding raw data comprises:
loading the original data into a computer memory according to a predefined data format;
and preprocessing the original data.
4. The method of deriving machine learning features from structured data in real time as claimed in claim 3 wherein the predefined data format comprises: field names, field types, and missing/outlier processing logic for the data.
5. The method for deriving machine learning features from structured data in real time as claimed in claim 4 wherein the preprocessing comprises a plurality of processing methods, the processing methods being abstracted as configurable options.
6. The method of deriving machine learning features in real time from structured data according to claim 5, wherein the feature computation logic comprises an extension function that processes the raw data first and an aggregation function that processes the raw data and/or the output of the extension function later.
7. The method of deriving machine learning features from structured data in real time as claimed in claim 6 wherein the generating executable program code from the feature computation logic uses regularization with a parser of an existing database or yacc or with corresponding grammar parsing.
CN201911393160.6A 2019-12-30 2019-12-30 Method for deriving machine learning characteristics from structured data in real time Pending CN111832740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911393160.6A CN111832740A (en) 2019-12-30 2019-12-30 Method for deriving machine learning characteristics from structured data in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911393160.6A CN111832740A (en) 2019-12-30 2019-12-30 Method for deriving machine learning characteristics from structured data in real time

Publications (1)

Publication Number Publication Date
CN111832740A true CN111832740A (en) 2020-10-27

Family

ID=72912757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911393160.6A Pending CN111832740A (en) 2019-12-30 2019-12-30 Method for deriving machine learning characteristics from structured data in real time

Country Status (1)

Country Link
CN (1) CN111832740A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364085A (en) * 2020-11-20 2021-02-12 浙江百应科技有限公司 Feature extraction and calculation method based on MapReduce thought
CN114064976A (en) * 2021-10-20 2022-02-18 同盾科技有限公司 Data feature calculation method, system, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090516A (en) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 Automatically generate the method and system of the feature of machine learning sample
CN109284833A (en) * 2018-08-22 2019-01-29 中国平安人寿保险股份有限公司 Method, equipment and the storage medium of characteristic are obtained for machine learning model
CN109408591A (en) * 2018-10-12 2019-03-01 北京聚云位智信息科技有限公司 Support the AI of SQL driving and the decision type distributed data base system of Feature Engineering
US20190155941A1 (en) * 2017-11-21 2019-05-23 International Business Machines Corporation Generating asset level classifications using machine learning
CN110046169A (en) * 2019-03-12 2019-07-23 阿里巴巴集团控股有限公司 Calculating based on structured query language sentence services implementation
CN110096513A (en) * 2019-04-10 2019-08-06 阿里巴巴集团控股有限公司 A kind of data query, fund checking method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190155941A1 (en) * 2017-11-21 2019-05-23 International Business Machines Corporation Generating asset level classifications using machine learning
CN108090516A (en) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 Automatically generate the method and system of the feature of machine learning sample
CN109284833A (en) * 2018-08-22 2019-01-29 中国平安人寿保险股份有限公司 Method, equipment and the storage medium of characteristic are obtained for machine learning model
CN109408591A (en) * 2018-10-12 2019-03-01 北京聚云位智信息科技有限公司 Support the AI of SQL driving and the decision type distributed data base system of Feature Engineering
CN110046169A (en) * 2019-03-12 2019-07-23 阿里巴巴集团控股有限公司 Calculating based on structured query language sentence services implementation
CN110096513A (en) * 2019-04-10 2019-08-06 阿里巴巴集团控股有限公司 A kind of data query, fund checking method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364085A (en) * 2020-11-20 2021-02-12 浙江百应科技有限公司 Feature extraction and calculation method based on MapReduce thought
CN114064976A (en) * 2021-10-20 2022-02-18 同盾科技有限公司 Data feature calculation method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
Damas et al. Generating annotated behavior models from end-user scenarios
CN109254776B (en) Multi-language code compiling method and compiler
CN109614432B (en) System and method for acquiring data blood relationship based on syntactic analysis
CA2548334A1 (en) An apparatus for migration and conversion of software code from any source platform to any target platform
CN108255837B (en) SQL parser and method
CN108664241B (en) Method for carrying out simulation verification on SysML model
Søgaard-Andersen et al. Computer-assisted simulation proofs
CN111832740A (en) Method for deriving machine learning characteristics from structured data in real time
CN108037913B (en) Method for converting xUML4MC model into MSVL (modeling, simulation and verification language) program and computer-readable storage medium
CN112306479A (en) Code visualization analysis method and device based on abstract syntax
CN111258564B (en) Method and device for automatically generating codes based on QT
Hoffmann et al. Cloning and expanding graph transformation rules for refactoring
Sergievskiy Description logic application for UML class diagrams optimization
Lano et al. Rigorous development in UML
CN115469860B (en) Method and system for automatically generating demand-to-software field model based on instruction set
Kim et al. An integrated framework with UML and Object-Z for developing a precise and understandable specification: the light control case study
CN115935943A (en) Analysis framework supporting natural language structure calculation
CN113448852A (en) Test case obtaining method and device, electronic equipment and storage medium
Alouini et al. Semi-automatic generation of transformation rules in model driven engineering: the challenge and first steps
CN111752980A (en) Law enforcement supervision intelligent early warning system and method
CN114064601A (en) Storage process conversion method, device, equipment and storage medium
Gruer et al. Towards verification of multi-agent systems
Oren et al. Design of SEMA: A Software System for Computer-Assisted Modelling and Simulation of Sequential Machines
De Meester High quality schema and data transformations for linked data generation
Mavridou Formalizing and Analyzing Requirements with FRET

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination