CN111832740A - Method for deriving machine learning characteristics from structured data in real time - Google Patents
Method for deriving machine learning characteristics from structured data in real time Download PDFInfo
- Publication number
- CN111832740A CN111832740A CN201911393160.6A CN201911393160A CN111832740A CN 111832740 A CN111832740 A CN 111832740A CN 201911393160 A CN201911393160 A CN 201911393160A CN 111832740 A CN111832740 A CN 111832740A
- Authority
- CN
- China
- Prior art keywords
- machine learning
- data
- feature
- real time
- structured data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000010801 machine learning Methods 0.000 title claims abstract description 37
- 230000006870 function Effects 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims abstract description 19
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 230000008569 process Effects 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 9
- 230000002776 aggregation Effects 0.000 claims description 7
- 238000004220 aggregation Methods 0.000 claims description 7
- 238000003672 processing method Methods 0.000 claims description 4
- 238000011161 development Methods 0.000 abstract description 10
- 238000007405 data analysis Methods 0.000 abstract description 2
- 238000009795 derivation Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/252—Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for deriving machine learning characteristics from structured data in real time, which comprises the following steps: defining a computer language comprising a plurality of computing functions for feature processing in machine learning; developing feature computation logic using the computer language; generating executable program code according to the feature calculation logic; the program code is executed to apply the feature computation logic to corresponding raw data to derive machine learning features. The invention has the advantages that the complexity and the development period of feature development are greatly reduced, and a modeling person of a data analysis background can flexibly and conveniently generate the required machine learning features.
Description
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a method for deriving characteristics for machine learning from structured data in real time.
Background
Machine learning is an artificial intelligence method, which uses probability and statistics to obtain a rule (generally called model) from data and uses the rule to reason unknown data. The machine learning has wide application fields, and is applied to finance, medicine, social public affairs and the like.
When solving an actual problem by machine learning, feature index processing or so-called feature index calculation is a very important step. Machine learning methods such as the commonly used xgboost, gbdt, lightgbm, etc. generally do not directly use original data (generally referred to as training data), but process the data into a form of feature index. The characteristic indexes are generally statistical derivatives of the original data or cross derivatives of a plurality of original data fields, so that information in the original data is fully mined, and the model effect generated by machine learning is optimal.
There is no fixed method for deriving features, so currently, in actual production, software code methods are mostly used for processing, and new code is written when new features are needed. The written code is abstracted into a relatively more universal module, and the written code can be repeatedly used for many times by using a configuration file to configure, so that the writing work of the code is reduced.
The current approach has two problems. Firstly, the feature derivation development work is large, more developers of the machine learning model are partial statistics and mathematic machines, additional development engineers are needed for assistance, and the development period is long; secondly, after the machine learning model is developed by a developer, production is required to be carried out and online, the machine learning model enters an online decision-making system, and feature processing work in the model development process cannot be reused at the position. Under the current environment that various businesses in the social industry are developed at a high speed, the requirements on the optimization and updating frequency of the model are higher and higher, and the original method is more and more difficult to meet the business requirements.
Disclosure of Invention
The invention provides a method for deriving the machine learning characteristics in real time for the structured data, and part of embodiments of the invention can combine the traditional SQL query method with the characteristic processing of machine learning, reduce the use threshold of user operation, use the same logic in offline and online processing, accelerate the development cycle and reduce the online test workload.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method of deriving machine learning features in real-time from structured data, the method comprising: defining a computer language comprising a plurality of computing functions for feature processing in machine learning; developing feature computation logic using the computer language; generating executable program code according to the feature calculation logic; the program code is executed to apply the feature computation logic to corresponding raw data to derive machine learning features.
Preferably, the computer language is SQL. Although technically they could be based on others, SQL is most familiar to model developers or analysts.
Preferably, said executing said program code to apply said feature computation logic to corresponding raw data comprises: loading the original data into a computer memory according to a predefined data format; and preprocessing the original data.
Preferably, the predefined data format comprises: field names, field types, and missing/outlier processing logic for the data.
Preferably, the pre-processing includes a plurality of processing methods, which are abstracted into configurable options.
Preferably, the feature computation logic comprises an extension function that processes the raw data first and an aggregation function that processes the raw data and/or an output result of the extension function second.
Preferably, the executable program code generated according to the feature computation logic uses regularization by a parser of an existing database, either yacc or in cooperation with corresponding grammar parsing.
Compared with the prior art, the invention has the beneficial effects that:
1. the complexity and the development period of feature development are greatly reduced, and a modeling worker of a data analysis background can flexibly and conveniently generate the required machine learning features;
2. the method has no dependence on an external system and technology, has completely the same operation method and steps under two scenes of off-line batch and on-line real-time computation, can reuse off-line computation logic to a production environment, and quickens the on-line development period of a model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating a class of computation functions of a feature query language according to an embodiment of the present invention.
FIG. 2 is a flow chart illustrating the operation of the process according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of the overall architecture and external interaction of the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
In the description of the present invention, it is to be understood that the terms "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used merely for convenience of description and for simplicity of description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention.
Referring to fig. 1-3, in one embodiment, the method comprises the following steps:
1. according to a common feature processing method in a machine learning process, some calculation subunits, namely calculation functions, are abstracted and summarized, and the calculation functions comprise statistics such as quantile, data cleaning and conversion such as date conversion and the like. These functions are added to the standard SQL specification and form an extended suite of SQL languages. This is the basis for defining the feature computation logic.
2. Analyzing the data to be processed, and formulating the field name, the field type and the missing value/abnormal value processing logic of the data. The logics are processed in a manual code mode in a common machine learning method, and the methods are abstracted into configurable options and are uniformly processed before the characteristics are calculated. Typical operations include setting certain values to null, replacing certain values outside of a business justified range, etc.
3. Feature computation logic is developed using feature definition SQL. The computation logic is to perform derivation computation on the fields in the original data by using a computation function. Two processes may be involved here: one is preprocessing, and some calculation functions are processing for converting and processing original data, such as simple arithmetic operations, cleaning of date format, etc., and the result of the processing is not the final result, which is called an expansion function; and another aggregation type function, wherein the result of the function is the final characteristic calculation result, namely the index variable used by machine learning. The input of the aggregation function may be the most primitive input data or the output of the preprocessing function. The method is divided into two stages for the following reasons: some complex operations may not result in a single pass, and some other operations may have some preprocessing steps in common, so that multiple aggregation functions may share the output of the same preprocessing function.
4. And processing the characteristic definition SQL by using a syntax parser to generate an executable data structure. Some existing database parser may be used here, but tools such as yacc may be used, and regularization may also be used, in conjunction with corresponding grammar parsing. In the course of the parsing process, an extended defined function needs to be added thereto. The SQL query language is eventually compiled here into truly executable program code structures.
5. And (3) loading the original data into the memory according to the format defined in the step (2), and processing missing values and abnormal values.
6. And applying the characteristic derivation logic to the data generated in the previous step, firstly operating a preprocessing function, then operating an aggregation function, and screening the result data if the SQL has corresponding definitions according to needs. The specific implementation of each function is referred to herein, and it is necessary to ensure that the logic of the implementation is identical to the logic used in the machine learning method. Finally, the characteristic variables which can be used for training the machine learning model are obtained.
In another embodiment, the method comprises the following steps:
1. an SQL-based feature definition language is defined. The basic format specification of the method is not different from that of standard ANSI SQL, but the method expands functions, and not only comprises traditional statistical calculation functions such as min/max/mean/sum/std/quantile, but also comprises calculation method functions commonly used in the field of machine learning for processing data, such as similarity of two texts calculated by similarity, change trend of running water data and the like. This step is to establish an initial specification, which is a one-time operation, and after being defined, the operation can be performed according to the specification without redefining each time, but then the language specification may be expanded as the business develops, and most typically, a new calculation function is added.
2. The format of the raw data is defined. The method comprises the field name, the field type, the missing value processing logic and the value range of original data of feature calculation.
3. Using the feature definition language in step 1, feature computation logic is defined according to the service data. The logic is written only by SQL language background, the threshold is very low, general data processing and calculating personnel have corresponding knowledge, and the logic can be written without special training.
4. The feature definition is parsed into a real executable data structure word by word. The data structure includes what pre-operations need to be performed on the data before computation (e.g., whether some values need to be removed, whether the data type needs to be converted, etc.), what logics need to be computed, what fields need to be computed, and whether the computed result needs to be converted. This step can be abstracted into a common syntactic and lexical parser, forming a reusable component. The steps of the loading feature definition SQL … … in fig. 2 assembled into an executable data structure may be understood as building an environment to implement the following extension functions and aggregation functions.
5. And (3) loading the original data into a computer memory operating area according to the format definition in the step (2).
6. And 4, calculating the loaded data in the memory according to the definition in the step 4 to obtain a final result.
Although the present invention has been described in detail with respect to the above embodiments, it will be understood by those skilled in the art that modifications or improvements based on the disclosure of the present invention may be made without departing from the spirit and scope of the invention, and these modifications and improvements are within the spirit and scope of the invention.
Claims (7)
1. A method for deriving machine learning features in real-time from structured data, the method comprising:
defining a computer language comprising a plurality of computing functions for feature processing in machine learning;
developing feature computation logic using the computer language;
generating executable program code according to the feature calculation logic;
the program code is executed to apply the feature computation logic to corresponding raw data to derive machine learning features.
2. The method for deriving machine learning features from structured data in real time as claimed in claim 1 wherein the computer language is SQL.
3. The method of deriving machine learning features from structured data in real time as claimed in claim 2 wherein executing the program code to apply the feature computation logic to corresponding raw data comprises:
loading the original data into a computer memory according to a predefined data format;
and preprocessing the original data.
4. The method of deriving machine learning features from structured data in real time as claimed in claim 3 wherein the predefined data format comprises: field names, field types, and missing/outlier processing logic for the data.
5. The method for deriving machine learning features from structured data in real time as claimed in claim 4 wherein the preprocessing comprises a plurality of processing methods, the processing methods being abstracted as configurable options.
6. The method of deriving machine learning features in real time from structured data according to claim 5, wherein the feature computation logic comprises an extension function that processes the raw data first and an aggregation function that processes the raw data and/or the output of the extension function later.
7. The method of deriving machine learning features from structured data in real time as claimed in claim 6 wherein the generating executable program code from the feature computation logic uses regularization with a parser of an existing database or yacc or with corresponding grammar parsing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911393160.6A CN111832740A (en) | 2019-12-30 | 2019-12-30 | Method for deriving machine learning characteristics from structured data in real time |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911393160.6A CN111832740A (en) | 2019-12-30 | 2019-12-30 | Method for deriving machine learning characteristics from structured data in real time |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111832740A true CN111832740A (en) | 2020-10-27 |
Family
ID=72912757
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911393160.6A Pending CN111832740A (en) | 2019-12-30 | 2019-12-30 | Method for deriving machine learning characteristics from structured data in real time |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111832740A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364085A (en) * | 2020-11-20 | 2021-02-12 | 浙江百应科技有限公司 | Feature extraction and calculation method based on MapReduce thought |
CN114064976A (en) * | 2021-10-20 | 2022-02-18 | 同盾科技有限公司 | Data feature calculation method, system, electronic device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108090516A (en) * | 2017-12-27 | 2018-05-29 | 第四范式(北京)技术有限公司 | Automatically generate the method and system of the feature of machine learning sample |
CN109284833A (en) * | 2018-08-22 | 2019-01-29 | 中国平安人寿保险股份有限公司 | Method, equipment and the storage medium of characteristic are obtained for machine learning model |
CN109408591A (en) * | 2018-10-12 | 2019-03-01 | 北京聚云位智信息科技有限公司 | Support the AI of SQL driving and the decision type distributed data base system of Feature Engineering |
US20190155941A1 (en) * | 2017-11-21 | 2019-05-23 | International Business Machines Corporation | Generating asset level classifications using machine learning |
CN110046169A (en) * | 2019-03-12 | 2019-07-23 | 阿里巴巴集团控股有限公司 | Calculating based on structured query language sentence services implementation |
CN110096513A (en) * | 2019-04-10 | 2019-08-06 | 阿里巴巴集团控股有限公司 | A kind of data query, fund checking method and device |
-
2019
- 2019-12-30 CN CN201911393160.6A patent/CN111832740A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190155941A1 (en) * | 2017-11-21 | 2019-05-23 | International Business Machines Corporation | Generating asset level classifications using machine learning |
CN108090516A (en) * | 2017-12-27 | 2018-05-29 | 第四范式(北京)技术有限公司 | Automatically generate the method and system of the feature of machine learning sample |
CN109284833A (en) * | 2018-08-22 | 2019-01-29 | 中国平安人寿保险股份有限公司 | Method, equipment and the storage medium of characteristic are obtained for machine learning model |
CN109408591A (en) * | 2018-10-12 | 2019-03-01 | 北京聚云位智信息科技有限公司 | Support the AI of SQL driving and the decision type distributed data base system of Feature Engineering |
CN110046169A (en) * | 2019-03-12 | 2019-07-23 | 阿里巴巴集团控股有限公司 | Calculating based on structured query language sentence services implementation |
CN110096513A (en) * | 2019-04-10 | 2019-08-06 | 阿里巴巴集团控股有限公司 | A kind of data query, fund checking method and device |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364085A (en) * | 2020-11-20 | 2021-02-12 | 浙江百应科技有限公司 | Feature extraction and calculation method based on MapReduce thought |
CN114064976A (en) * | 2021-10-20 | 2022-02-18 | 同盾科技有限公司 | Data feature calculation method, system, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Damas et al. | Generating annotated behavior models from end-user scenarios | |
CN109254776B (en) | Multi-language code compiling method and compiler | |
CN109614432B (en) | System and method for acquiring data blood relationship based on syntactic analysis | |
CA2548334A1 (en) | An apparatus for migration and conversion of software code from any source platform to any target platform | |
CN108255837B (en) | SQL parser and method | |
CN108664241B (en) | Method for carrying out simulation verification on SysML model | |
Søgaard-Andersen et al. | Computer-assisted simulation proofs | |
CN111832740A (en) | Method for deriving machine learning characteristics from structured data in real time | |
CN108037913B (en) | Method for converting xUML4MC model into MSVL (modeling, simulation and verification language) program and computer-readable storage medium | |
CN112306479A (en) | Code visualization analysis method and device based on abstract syntax | |
CN111258564B (en) | Method and device for automatically generating codes based on QT | |
Hoffmann et al. | Cloning and expanding graph transformation rules for refactoring | |
Sergievskiy | Description logic application for UML class diagrams optimization | |
Lano et al. | Rigorous development in UML | |
CN115469860B (en) | Method and system for automatically generating demand-to-software field model based on instruction set | |
Kim et al. | An integrated framework with UML and Object-Z for developing a precise and understandable specification: the light control case study | |
CN115935943A (en) | Analysis framework supporting natural language structure calculation | |
CN113448852A (en) | Test case obtaining method and device, electronic equipment and storage medium | |
Alouini et al. | Semi-automatic generation of transformation rules in model driven engineering: the challenge and first steps | |
CN111752980A (en) | Law enforcement supervision intelligent early warning system and method | |
CN114064601A (en) | Storage process conversion method, device, equipment and storage medium | |
Gruer et al. | Towards verification of multi-agent systems | |
Oren et al. | Design of SEMA: A Software System for Computer-Assisted Modelling and Simulation of Sequential Machines | |
De Meester | High quality schema and data transformations for linked data generation | |
Mavridou | Formalizing and Analyzing Requirements with FRET |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |