WO2020038376A1 - 统一地执行特征抽取的方法及系统 - Google Patents

统一地执行特征抽取的方法及系统 Download PDF

Info

Publication number
WO2020038376A1
WO2020038376A1 PCT/CN2019/101649 CN2019101649W WO2020038376A1 WO 2020038376 A1 WO2020038376 A1 WO 2020038376A1 CN 2019101649 W CN2019101649 W CN 2019101649W WO 2020038376 A1 WO2020038376 A1 WO 2020038376A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature extraction
processing logic
execution plan
cluster
scenario
Prior art date
Application number
PCT/CN2019/101649
Other languages
English (en)
French (fr)
Inventor
黄亚建
王太泽
邓龙
范晓亮
刘晨璐
刘永超
孙迪
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Priority to US17/270,248 priority Critical patent/US20210326761A1/en
Priority to EP19852643.6A priority patent/EP3842940A4/en
Publication of WO2020038376A1 publication Critical patent/WO2020038376A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Definitions

  • the present disclosure relates generally to the field of data processing, and more particularly, to a method and system for uniformly performing feature extraction.
  • Machine learning is an inevitable product of the development of artificial intelligence research to a certain stage. It is committed to improving the performance of the system itself through the use of calculations and the use of experience.
  • experience usually exists in the form of "data”.
  • models can be generated from data. That is, by providing empirical data to machine learning algorithms, they can be generated based on these empirical data. Model; when facing a new situation, use the trained model to get the corresponding prediction result. Regardless of the stage of training the machine learning model or the stage of estimation using the machine learning model, it is necessary to perform feature extraction on the data to obtain machine learning samples including various features.
  • the current machine learning platform or system mainly implements the function of training machine learning models, that is, the process of feature extraction, model construction, model tuning and other operations using collected large-scale data. At this stage, no attention is paid to response speed. But emphasis is placed on throughput, which is the amount of data processed per unit time. If you need to use a trained machine learning model for estimation, you often care about response speed and don't pay attention to throughput, which makes technicians have to perform additional development for the estimation stage, especially for the feature extraction process. Development, resulting in higher estimated costs.
  • Exemplary embodiments of the present disclosure are to provide a method and system for performing feature extraction uniformly, which can perform feature extraction uniformly in various feature extraction scenarios.
  • a method for uniformly performing feature extraction includes: obtaining a feature extraction script for defining processing logic related to feature extraction; parsing the feature extraction script, To generate an execution plan for feature extraction; and based on the feature extraction scenario, the generated execution plan is executed by native or cluster execution.
  • a system for uniformly performing feature extraction includes: a script acquisition device that acquires a feature extraction script for defining processing logic related to feature extraction; a plan A generation device that parses the feature extraction script to generate an execution plan for performing feature extraction; and a plan execution device that executes the generated execution plan based on the feature extraction scenario through local or cluster execution.
  • a system including at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one A computing device performs the method of performing feature extraction uniformly as described above.
  • a computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, the at least one computing device is caused to perform the unification as described above. To perform feature extraction.
  • the method and system for uniformly performing feature extraction can support uniformly performing feature extraction in various feature extraction scenarios.
  • it is compatible with the online feature extraction scene and the offline feature extraction scene, and realizes the seamless connection between the online feature extraction scene and the offline feature extraction scene, thereby eliminating the need for separate online feature extraction scenarios and
  • the development of specific operating methods in offline feature extraction scenarios reduces the workload of developers; on the other hand, it can efficiently perform feature extraction in a high-throughput manner in offline feature extraction scenarios.
  • feature extraction can be performed with high real-time and low latency.
  • it is compatible with time-series feature extraction and non-time-series feature extraction.
  • FIG. 1 illustrates a flowchart of a method for uniformly performing feature extraction according to an exemplary embodiment of the present disclosure
  • FIG. 2 illustrates an example of an execution plan according to an exemplary embodiment of the present disclosure
  • FIG. 3 illustrates a flowchart of a method for uniformly performing feature extraction according to another exemplary embodiment of the present disclosure
  • FIG. 4 illustrates a block diagram of a system for uniformly performing feature extraction according to an exemplary embodiment of the present disclosure.
  • FIG. 1 illustrates a flowchart of a method for uniformly performing feature extraction according to an exemplary embodiment of the present disclosure.
  • the method may be executed by a computer program, and may also be executed by a dedicated hardware device or a collection of software and hardware resources for performing machine learning, big data calculation, or data analysis.
  • the method may be executed by a machine learning platform for implementing a machine learning related business.
  • step S10 a feature extraction script for defining processing logic related to feature extraction is obtained.
  • processing logic related to feature extraction may include any processing logic related to feature extraction.
  • processing logic related to feature extraction may include processing logic for obtaining features from a data table.
  • the data table may be an original data table or a data table obtained by processing the original data table (for example, splicing a plurality of original data tables).
  • the processing logic related to feature extraction may further include processing logic for performing data table splicing.
  • the processing logic for data table splicing may include processing logic for data table splicing for the source field of the feature.
  • the processing logic for data table splicing for the source field of the feature is: Only the feature source fields in the data table to be spliced are processed to form a new data table processing logic.
  • each data record in the data table can be viewed as a description of an event or object, corresponding to an example or example.
  • the data record it includes attribute information, that is, a field, which reflects the performance or nature of the event or object in a certain aspect.
  • attribute information that is, a field, which reflects the performance or nature of the event or object in a certain aspect.
  • one row of the data table corresponds to one data record
  • one column of the data table corresponds to one field.
  • processing logic related to feature extraction may involve feature extraction under one or more time windows.
  • the time window can be used to filter out one or more data records that need to be relied upon when generating features, wherein when the time window is set to include only one data record, it can be used to generate non-time series features, and when the time window is When set to include multiple data records, it can be used to generate timing features.
  • processing logic related to feature extraction may involve extracting one or more features under each time window.
  • the processing logic related to feature extraction may further include processing logic for feature summary.
  • the time window may be defined by a source data table, a division datum field, a time datum field, a time span, and / or a window size.
  • the source data table of the time window is the data table on which feature extraction is performed under the time window;
  • the division reference field of the time window is the data record in the source data table that is grouped (that is, fragmented).
  • the field based on for example, user ID
  • the time reference field of the time window is the field on which each group of data records is sorted (for example, the time when the user swipes the card);
  • the time span of the time window is the data record within the time window
  • the time range for example, one week
  • the window size of the time window is the amount of data in the time window, and the window size is an integer greater than zero. It should be understood that when defining a time window, one of a time span and a window size may be set, and both a time span and a window size may be set.
  • the multiple time windows are different from each other, that is, there is at least one of the following among the multiple time windows. Items are different: source data table, division reference field, time reference field, time span, window size.
  • processing logic related to feature extraction may involve performing non-time series feature extraction in a time window with a window size of 1 and / or time series feature extraction in a time window with a window size other than 1.
  • Time series data has a strong sequence, and the data before and after generally have relationships such as dependence and period.
  • transaction data may show a strong correlation over time, so the statistical results of transaction data can be used as a sample feature. Therefore, the characteristics reflecting time-series behaviors can be generated based on the time window (for example, recent transaction habits (such as amount), etc.).
  • the dimension of the time series data that is, the division reference field of the time window
  • relevant features e.g., time series statistical characteristics about the transaction amount
  • extracts the related features according to the card number where the transaction occurred in addition, the range of historical data (i.e., The time span and / or window size of the time window), for example, the transaction amount in the last week.
  • the time window corresponding to the time-series feature extraction can specify all data records (including current data records and / or historical data records) on which the features to be extracted currently depend, and then can calculate the current data records based on the relevant field values in these data records.
  • Feature extraction can specify all data records (including current data records and / or historical data records) on which the features to be extracted currently depend, and then can calculate the current data records based on the relevant field values in these data records.
  • non-temporal feature extraction may be considered under a time window with a window size of 1, thereby enabling a unified time window setting to be compatible with extraction of both temporal and non-temporal features.
  • the exemplary embodiments of the present disclosure may also perform non-temporal feature extraction under a time window.
  • the processing logic related to feature extraction may not involve a time window, that is, it is not necessary to set any time window for feature extraction.
  • processing logic related to feature extraction may involve performing non-temporal feature extraction in a time window with a window size of 1 and Time series feature extraction is performed under a time window of 1.
  • a feature extraction script for defining processing logic related to feature extraction may be obtained directly from the outside.
  • the feature extraction script may be obtained based on a code input by a user through an input box for defining processing logic related to feature extraction and / or a configuration item configured by the user for defining processing logic related to feature extraction.
  • the method may be performed by a machine learning platform for performing a machine learning process, and the machine learning platform may provide the user with a graphical interface for configuring a feature extraction process (for example, an interface for configuring feature engineering) in response to a user operation.
  • the graphic interface may include input controls for defining processing logic related to feature extraction; then, an input operation performed by the user on the input control on the graphic interface may be received, and according to the Enter an operation to obtain a feature extraction script that defines processing logic related to feature extraction.
  • the input control may include a content input box for inputting code and / or configuration items for defining processing logic related to feature extraction, and / or candidates for processing logic related to feature extraction. Selection controls for selecting between configuration items.
  • step S20 the obtained feature extraction script is parsed to generate an execution plan for performing feature extraction.
  • the processing logic defined by the acquired feature extraction script can be divided in the processing order to generate an execution plan for feature extraction. Since the feature extraction process needs to be performed in a certain processing order, for example, the feature extraction process needs to be processed through data table splicing, obtaining features from the data table, and summarizing the generated features, so the processing order of the feature extraction process can be followed
  • the processing logic defined by the acquired feature extraction script may be divided into processing logic sections for data table splicing, and processing logic for acquiring features from the data table Part, the processing logic part for feature summary; then, based on each segmented processing logic part, an execution plan for feature extraction can be generated.
  • the corresponding processing logic may be divided according to the processing order for each time window to generate an execution for feature extraction. plan. That is, processing logic corresponding to different time windows is not divided into the same processing logic portion.
  • the processing logic defined by the obtained feature extraction script involves feature extraction under multiple time windows
  • the processing logic corresponding to the time window may be divided for each time window according to the processing order of the feature extraction process
  • the processing logic defined by the obtained feature extraction script can be divided into: a processing logic part for data table splicing and a processing logic part for obtaining features from the data table for each time window, and The processing logic part that summarizes the features generated for all time windows; then, based on each segmented processing logic part, an execution plan for feature extraction can be generated.
  • the generated execution plan for feature extraction may be a directed acyclic graph (DAG graph) composed of nodes, where the nodes correspond to the segmented processing logic.
  • the node may include a computing node corresponding to processing logic for obtaining features from a data table.
  • the node may further include a spelling table node corresponding to the processing logic for data table splicing and / or a feature splicing node corresponding to the processing logic for feature summary.
  • the processing logic for acquiring features from the data table for different time windows may correspond to different computing nodes, and the processing logic for stitching different data tables may correspond to different spelling table nodes. It should be understood that the connection relationship between the nodes corresponding to the divided processing logic parts may be determined based on the relationship between the input variables and / or output variables of the divided processing logic parts.
  • FIG. 2 illustrates an example of an execution plan according to an exemplary embodiment of the present disclosure.
  • the processing logic defined by the obtained feature extraction script can be divided into: the processing logic part for data table splicing for time window 1 (for example, for data Table 1 and data table 2 are stitched to obtain the processing logic part of the source data table for time window 1), the processing logic part for time window 1 to obtain features from the data table (that is, to perform feature extraction), and for the time window
  • the processing logic part for data table splicing of 2 (for example, the processing logic part for splicing data table 1 and data table 3 to obtain the source data table of time window 2), and the processing logic part for time window 2 from The processing logic part for acquiring features in the data table, and the processing logic part for performing feature aggregation on the features acquired based on time window 1 and the features acquired based on time window 2.
  • a directed acyclic graph composed of nodes as shown in FIG. 2 (that is, an execution plan for feature extraction) can be generated based on the divided processing logic parts.
  • the processing logic part of window 1 for data table splicing corresponds
  • the computing node 1 corresponds to the processing logic part for time window 1 for obtaining features from the data table
  • the table node 2 and the processing logic part for time window 2 correspond to
  • the processing logic part of the data table splicing corresponds
  • the computing node 2 corresponds to the processing logic part for obtaining features from the data table for the time window 2
  • the feature splicing node corresponds to the features obtained based on the time window 1 and the time window 2
  • the execution plan generated by local or cluster execution is executed.
  • the feature extraction scene may be an online feature extraction scene or an offline feature extraction scene.
  • the feature extraction scenario can be used to implement the connection with each node through the local or cluster based on the connection relationship between the nodes in the directed acyclic graph.
  • the processing logic corresponding to each node executes the generated execution plan.
  • implementing the processing logic corresponding to the computing node through the local machine or the cluster may include: directly running the computing node through the local machine or the cluster.
  • implementing the processing logic corresponding to a computing node through a local machine or a cluster may include compiling the processing logic corresponding to the computing node into at least one executable file through a local machine or a cluster, and running the at least one executable file. file.
  • corresponding optimization may be performed.
  • a common sub-expression in the processing logic may be replaced with an intermediate variable.
  • part of the processing logic in which the processing logic is closely related and independent of other processing logic is compiled into the same executable file in.
  • part of the processing logic that is closely related and independent of other processing logic may be part of the processing logic that uses the same common subexpression and is not logically related to other processing logic, so that the part of the processing logic can share intermediate variables And, since intermediate variables are not shared between different executable files, different executable files can be executed in parallel.
  • the JIT (Just-In-Time) function of the compiler can be reused, the execution efficiency of the code in the compiled executable file can be improved, and the logic for the concurrent execution of the feature extraction process can be prepared. Isolate to execute multiple executables in parallel.
  • the processing logic corresponding to the computing node can be compiled into at least one executable file.
  • FIG. 3 shows a flowchart of a method for uniformly performing feature extraction according to another exemplary embodiment of the present disclosure.
  • step S30 specifically includes steps S301, S302, and S303.
  • steps S10 and S20 reference may be made to FIG. 1 and FIG. 2 are implemented by the specific implementation manners, and details are not described herein again.
  • step S301 a feature extraction scene is determined.
  • a user-specified feature extraction scene may be obtained.
  • the method may be performed by a machine learning platform for performing a machine learning process.
  • the machine learning platform may provide a user with a graphical interface for specifying a feature extraction scene, and obtain a user according to an input operation performed by the user through the graphical interface.
  • the specified feature extraction scene may be obtained.
  • a feature extraction scene may be automatically determined.
  • the feature extraction scene can be automatically determined as an offline feature extraction scene; when the current machine learning scene is estimated using a trained machine learning model
  • feature extraction scenarios can be automatically determined as online feature extraction scenarios.
  • the generated execution plan is executed by the local machine in a stand-alone mode.
  • the generated execution plan can be executed in a stand-alone mode through a native in-memory database.
  • the processing logic for data table splicing and / or the processing logic for feature summary may be implemented through a native memory database.
  • step S303 is executed to execute the generated execution plan in a distributed mode through the cluster.
  • the generated execution plan may be executed in parallel by a plurality of computing devices in the cluster.
  • the computing device referred to herein may indicate both a physical entity and a virtual entity.
  • the computing device may indicate an actual computing machine or a logical entity deployed on the computing machine.
  • the parallel execution framework Spark may be used to execute the generated execution plan in a distributed mode through a cluster.
  • processing logic such as data table splicing and processing logic for feature summary can be implemented through the underlying interface of Spark.
  • the generated execution plan for feature extraction may be distributed to each computing device in the cluster based on Spark, so that each computing device executes the generated execution plan based on data stored therein and returns an execution result.
  • the execution plan generated by the cluster can also be executed in a distributed mode based on other parallel computing frameworks.
  • step S303 may include: providing a list of candidate clusters to the user; and then, executing the generated execution plan in a distributed mode by the cluster selected by the user from the list.
  • a local or cluster executes a unified execution plan according to the feature extraction scenario.
  • the execution plan generated by local execution is executed
  • the offline feature extraction scenario the generated execution plan is executed by cluster execution.
  • it is compatible with the online feature extraction scenario and the offline feature extraction scenario to achieve The seamless connection between the online feature extraction scenario and the offline feature extraction scenario is eliminated, thereby eliminating the need for the same feature extraction script to develop the specific operation modes in the online feature extraction scenario and the offline feature extraction scenario, reducing the workload of the developer.
  • feature extraction can be performed efficiently in a high-throughput manner in offline feature extraction scenarios, and feature extraction can be performed with high real-time and low latency in online feature extraction scenarios.
  • step S303 may include: implementing the following operations in a distributed mode by the cluster to implement processing logic corresponding to the computing node for feature extraction under a time window: the source data table of the time window has the same division reference field value
  • Data records are divided into the same group (that is, different groups correspond to different division reference field values), and the data records in the same group are in ascending order of the time reference field values (that is, the time reference field values correspond to Time sequence), and then based on the sorted data records in the same group, feature extraction is performed in this time window.
  • the corresponding data window can be used.
  • Each data record is processed according to the value of the source field that each feature depends on to obtain each feature, wherein each data record within the time window is obtained from the corresponding time span and / or window size. Filtered out of the group.
  • step S302 may include: performing the following operations in a stand-alone mode to implement processing logic corresponding to a computing node for feature extraction under a time window: for the current data record, the data within the corresponding time window may be used Each data record is processed according to the value of the source field that each feature depends on to obtain each feature, wherein each data record within the time window is obtained from the corresponding time span and / or window size. Filtered out of the group.
  • FIG. 4 illustrates a block diagram of a system for uniformly performing feature extraction according to an exemplary embodiment of the present disclosure.
  • a system for uniformly performing feature extraction according to an exemplary embodiment of the present disclosure includes a script acquisition device 10, a plan generation device 20, and a plan execution device 30.
  • the script acquisition device 10 is configured to acquire a feature extraction script for defining processing logic related to feature extraction.
  • processing logic related to feature extraction may include any processing logic related to feature extraction.
  • processing logic related to feature extraction may include processing logic for obtaining features from a data table.
  • the data table may be an original data table or a data table obtained by processing the original data table (for example, splicing a plurality of original data tables).
  • the processing logic related to feature extraction may further include processing logic for performing data table splicing.
  • the processing logic for data table splicing may include processing logic for data table splicing for the source field of the feature.
  • the processing logic for data table splicing for the source field of the feature is: Only the feature source fields in the data table to be spliced are processed to form a new data table processing logic.
  • processing logic related to feature extraction may involve feature extraction under one or more time windows.
  • the time window can be used to filter out one or more data records that need to be relied upon when generating features, wherein when the time window is set to include only one data record, it can be used to generate non-time series features, and when the time window is When set to include multiple data records, it can be used to generate timing features.
  • processing logic related to feature extraction may involve extracting one or more features under each time window.
  • the processing logic related to feature extraction may further include processing logic for feature summary.
  • the time window may be defined by a source data table, a division datum field, a time datum field, a time span, and / or a window size.
  • the source data table of the time window is the data table on which feature extraction is performed under the time window;
  • the division reference field of the time window is the data record in the source data table that is grouped (that is, fragmented).
  • the field based on for example, user ID
  • the time reference field of the time window is the field on which each group of data records is sorted (for example, the time when the user swipes the card);
  • the time span of the time window is the data record within the time window
  • the time range corresponding to the time reference field for example, one week
  • the window size of the time window is the amount of data in the time window, and the window size is an integer greater than zero. It should be understood that when defining a time window, one of a time span and a window size may be set, and both a time span and a window size may be set.
  • the multiple time windows are different from each other, that is, there is at least one of the following among the multiple time windows. Items are different: source data table, division reference field, time reference field, time span, window size.
  • processing logic related to feature extraction may involve performing non-time series feature extraction in a time window with a window size of 1 and / or time series feature extraction in a time window with a window size other than 1.
  • the script acquisition device 10 may directly acquire a feature extraction script for defining processing logic related to feature extraction from the outside.
  • the script acquiring device 10 may acquire based on a code input by a user through an input box for defining processing logic related to feature extraction and / or a configuration item configured by a user for defining processing logic related to feature extraction.
  • Feature extraction script may be directly acquired.
  • the plan generating device 20 is configured to parse the feature extraction script to generate an execution plan for performing feature extraction.
  • the plan generation device 20 may divide the processing logic defined by the feature extraction script in a processing order to generate an execution plan for performing feature extraction.
  • the plan generation device 20 may divide the corresponding processing logic according to the processing order for each time window to generate an execution plan for performing feature extraction.
  • the generated execution plan for feature extraction may be a directed acyclic graph (DAG graph) composed of nodes, where the nodes correspond to the segmented processing logic.
  • the node may include a computing node corresponding to processing logic for obtaining features from a data table.
  • the node may further include a spelling table node corresponding to the processing logic for data table splicing and / or a feature splicing node corresponding to the processing logic for feature summary.
  • the processing logic for acquiring features from the data table for different time windows may correspond to different computing nodes, and the processing logic for stitching different data tables may correspond to different spelling table nodes. It should be understood that the connection relationship between the nodes corresponding to the divided processing logic parts may be determined based on the relationship between the input variables and / or output variables of the divided processing logic parts.
  • the plan execution device 30 is configured to execute a generated execution plan through local or cluster execution based on a feature extraction scenario.
  • the feature extraction scene may be an online feature extraction scene or an offline feature extraction scene.
  • the plan execution device 30 may acquire a feature extraction scene specified by the user.
  • the system may be deployed on a machine learning platform for performing a machine learning process
  • a display device may provide a user with a graphical interface for specifying a feature extraction scene
  • the plan execution device 30 may perform a process based on the user's execution through the graphical interface. Enter an operation to obtain a user-specified feature extraction scene.
  • the plan execution device 30 may automatically determine a feature extraction scene. For example, when the current machine learning scene is a machine learning scene training a machine learning model, the plan execution device 30 may automatically determine the feature extraction scene as an offline feature extraction scene; when the current machine learning scene is using a trained machine learning model When performing an estimated machine learning scenario, the plan execution device 30 may automatically determine the feature extraction scenario as an online feature extraction scenario.
  • the plan execution apparatus 30 executes the generated execution plan in a stand-alone mode through the local machine.
  • the system may be deployed on a machine learning platform for performing a machine learning process, that is, a local computing device that currently uses the machine learning platform for feature extraction.
  • the plan execution device 30 executes the generated execution plan in a distributed mode through the cluster.
  • plan execution device 30 may execute the generated execution plan in a distributed mode through a cluster based on the parallel computing framework Spark.
  • the plan execution device 30 may execute processing logic corresponding to each node through a local or cluster based on feature extraction scenarios to execute the generated execution plan.
  • the plan execution device 30 may compile the processing logic corresponding to the computing node into at least one executable file through a local machine or a cluster, and run the at least one executable file.
  • the plan execution device 30 can perform corresponding optimization when compiling the executable file.
  • the plan execution device 30 may replace a common sub-expression in the processing logic with an intermediate variable.
  • the part of the processing logic in which the processing logic is closely related and independent of other processing logic is compiled in the same process.
  • the executable file In the executable file.
  • the plan execution device 30 may provide the user with a list of candidate clusters; and execute the generated execution plan in a distributed mode through the clusters selected by the user from the list.
  • Devices included in a system that uniformly performs feature extraction according to an exemplary embodiment of the present disclosure may be respectively configured as software, hardware, firmware, or any combination of the above, that perform specific functions.
  • these devices may correspond to dedicated integrated circuits, may also correspond to pure software codes, and may also correspond to modules combining software and hardware.
  • one or more functions implemented by these devices may also be performed by components in a physical entity device (for example, a processor, a client, or a server).
  • the method for uniformly performing feature extraction may be implemented by a program recorded on a computer-readable medium.
  • a computer-readable medium for performing feature extraction wherein a computer program for performing the following method steps is recorded on the computer-readable medium: obtaining a feature extraction script for defining processing logic related to feature extraction; analyzing the Feature extraction scripts to generate execution plans for feature extraction; and based on feature extraction scenarios, the execution plans generated by native or cluster execution.
  • the computer program in the computer-readable medium described above can be run in an environment deployed in a computer device such as a client, host, proxy device, server, etc. It should be noted that the computer program can also be used to perform additional steps in addition to the above steps or When performing the above steps, more specific processing is performed. The content of these additional steps and further processing has been described with reference to FIGS. 1 to 3, and will not be repeated here in order to avoid repetition.
  • a system that uniformly performs feature extraction according to an exemplary embodiment of the present disclosure may completely rely on the operation of a computer program to implement corresponding functions, that is, the functional architecture of each device and computer program corresponds to each step, so that the entire system It is called by a special software package (for example, a lib library) to implement the corresponding function.
  • a special software package for example, a lib library
  • each device included in the system for uniformly performing feature extraction may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof.
  • the program code or code segments for performing corresponding operations may be stored in a computer-readable medium such as a storage medium, so that the processor can read and run the corresponding program Code or code segment to perform the corresponding operation.
  • the exemplary embodiment of the present disclosure may also be implemented as a computing device.
  • the computing device includes a storage component and a processor.
  • the storage component stores a computer-executable instruction set.
  • the computer-executable instruction set is stored by the processor, At execution time, a method of performing feature extraction uniformly is performed.
  • the computing device may be deployed in a server or a client, or may be deployed on a node device in a distributed network environment.
  • the computing device may be a PC computer, a tablet device, a personal digital assistant, a smart phone, a web application, or other devices capable of executing the above instruction set.
  • the computing device does not have to be a single computing device, but may also be an assembly of any device or circuit capable of individually or jointly executing the above instructions (or instruction set).
  • the computing device may also be part of an integrated control system or system manager, or a portable electronic device that may be configured to interface with a local or remote (e.g., via wireless transmission) interface.
  • the processor may include a central processing unit (CPU), a graphics processor (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor.
  • processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
  • Some operations described in the method for uniformly performing feature extraction according to an exemplary embodiment of the present disclosure may be implemented by software, some operations may be implemented by hardware, and furthermore, by a combination of software and hardware Implement these operations.
  • the processor may execute instructions or code stored in one of the storage components, wherein the storage component may also store data. Instructions and data can also be sent and received over a network via a network interface device, which can employ any known transmission protocol.
  • the storage unit may be integrated with the processor, for example, the RAM or the flash memory is arranged in an integrated circuit microprocessor or the like.
  • the storage component may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system.
  • the storage unit and the processor may be operatively coupled, or may communicate with each other, for example, through an I / O port, a network connection, or the like, so that the processor can read a file stored in the storage unit.
  • the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and / or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • a computing device that uniformly performs feature extraction may include a storage component and a processor, wherein the storage component stores a computer-executable instruction set, and when the computer-executable instruction When the collection is executed by the processor, the following steps are performed: obtaining a feature extraction script for defining processing logic related to feature extraction; parsing the feature extraction script to generate an execution plan for feature extraction; and based on Feature extraction scenarios, execution plans generated by native or cluster execution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Stored Programmes (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供一种统一地执行特征抽取的方法及系统。所述方法包括:获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及基于特征抽取场景,通过本机或集群执行生成的执行计划。根据所述方法及系统,能够在各种特征抽取场景下,统一地执行特征抽取。

Description

统一地执行特征抽取的方法及系统 技术领域
本公开总体说来涉及数据处理领域,更具体地讲,涉及一种统一地执行特征抽取的方法及系统。
背景技术
随着海量数据的出现,人们倾向于使用机器学习技术来从数据中挖掘出价值。机器学习是人工智能研究发展到一定阶段的必然产物,其致力于通过计算的手段,利用经验来改善系统自身的性能。在计算机系统中,“经验”通常以“数据”形式存在,通过机器学习算法,可从数据中产生“模型”,也就是说,将经验数据提供给机器学习算法,就能基于这些经验数据产生模型;在面对新的情况时,利用训练好的模型来得到相应的预测结果。不论是训练机器学习模型的阶段,还是利用机器学习模型进行预估的阶段,都需要对数据进行特征抽取来得到包括各种特征的机器学习样本。
当前的机器学习平台或系统主要实现的是训练机器学习模型的功能,即,利用已收集好的大规模数据进行特征抽取、模型构建、模型调优等操作的过程,该阶段不重视响应速度,但重视吞吐量,即单位时间内处理的数据量。如果需要使用已训练好的机器学习模型进行预估,则往往在乎的是响应速度,而不关注吞吐量,这使得技术人员不得不针对预估阶段进行额外开发,尤其需要针对特征抽取过程进行额外开发,导致实现预估的成本较高。
发明内容
本公开的示例性实施例在于提供一种统一地执行特征抽取的方法及系统,其能够在各种特征抽取场景下,统一地执行特征抽取。
根据本公开的示例性实施例,提供一种统一地执行特征抽取的方法,其中,所述方法包括:获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及基于特征抽取场景,通过本机或集群执行生成的执行计划。
根据本公开的另一示例性实施例,提供一种统一地执行特征抽取的系统,其中,所述系统包括:脚本获取装置,获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;计划生成装置,解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及计划执行装置,基于特征抽取场景,通过本机或集群执行生成的执行计划。
根据本公开的另一示例性实施例,提供一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的统一地执行特征抽取的方法。
根据本公开的另一示例性实施例,提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的统一地执行特征抽取的方法。
根据本公开示例性实施例的统一地执行特征抽取的方法及系统,其能够支持在各种特征抽取场景下,统一地执行特征抽取。作为示例,一方面,能够兼容在线特征抽取场景和离线特征抽取场景,实现了在线特征抽取场景和离线特征抽取场景的无缝对接,从而,无需针对同一特征抽取脚本,分别进行在线特征抽取场景和离线特征抽取场景下的具体运行方式的开发,减少了开发人员的工作量;另一方面,能够在离线特征抽取场景下,以较高吞吐量的方式高效地进行特征抽取,并且,在在线特征抽取场景下,能够高实时性、低延时地进行特征抽取。此外,根据所述方法及系统,能够兼容时序特征抽取和非时序特征抽取。
将在接下来的描述中部分阐述本公开总体构思另外的方面和/或优点,还有一部分通过描述将是清楚的,或者可以经过本公开总体构思的实施而得知。
附图说明
通过下面结合示例性地示出实施例的附图进行的描述,本公开示例性实施例的上述和其他目的和特点将会变得更加清楚,其中:
图1示出根据本公开示例性实施例的统一地执行特征抽取的方法的流程图;
图2示出根据本公开示例性实施例的执行计划的示例;
图3示出根据本公开的另一示例性实施例的统一地执行特征抽取的方法 的流程图;
图4示出根据本公开示例性实施例的统一地执行特征抽取的系统的框图。
具体实施方式
现将详细参照本公开的实施例,所述实施例的示例在附图中示出,其中,相同的标号始终指的是相同的部件。以下将通过参照附图来说明所述实施例,以便解释本公开。在此需要说明的是,在本公开中出现的“并且/或者”、“和/或”均表示包含三种并列的情况。例如“包括A和/或B”表示包括A和B中的至少一个,即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。类似地,“包括A、B和/或C”表示包括A、B和C中的至少一个。又例如“执行步骤一并且/或者步骤二”表示执行步骤一和步骤二中的至少一个,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。
图1示出根据本公开示例性实施例的统一地执行特征抽取的方法的流程图。这里,作为示例,所述方法可通过计算机程序来执行,也可由专门的用于执行机器学习、大数据计算、或数据分析等的硬件设备或软硬件资源的集合体来执行,例如,所述方法可由用于实现机器学习相关业务的机器学习平台来执行。
参照图1,在步骤S10中,获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本。
这里,与特征抽取相关的处理逻辑可包括任何与特征抽取有关的处理逻辑。作为示例,与特征抽取相关的处理逻辑可包括用于从数据表获取特征的处理逻辑。这里,所述数据表可以是原始数据表,也可以是对原始数据表进行处理(例如,对多个原始数据表进行拼接)后得到的数据表。
作为示例,当所述数据表是对多个原始数据表进行拼接后得到的数据表时,与特征抽取相关的处理逻辑还可包括用于进行数据表拼接的处理逻辑。作为优选示例,用于进行数据表拼接的处理逻辑可包括用于针对特征的来源字段进行数据表拼接的处理逻辑,这里,用于针对特征的来源字段进行数据表拼接的处理逻辑即:用于仅将待拼接的数据表中的特征来源字段进行拼接以形成新的数据表的处理逻辑。
这里,数据表中的每条数据记录可被看作关于一个事件或对象的描述, 对应于一个示例或样例。在数据记录中,包括反映事件或对象在某方面的表现或性质的属性信息,即字段。例如,数据表的一行对应一条数据记录,数据表的一列对应一个字段。
作为示例,与特征抽取相关的处理逻辑可涉及在一个或多个时间窗口下进行特征抽取。这里,时间窗口可用于筛选出生成特征时所需要依赖的一条或多条数据记录,其中,在时间窗口被设置为仅包括一条数据记录时,其可用于生成非时序特征,而在时间窗口被设置为包括多条数据记录时,其可用于生成时序特征。应该理解,与特征抽取相关的处理逻辑可涉及在每个时间窗口下进行一个或多个特征的抽取。作为示例,当与特征抽取相关的处理逻辑涉及在多个时间窗口下进行特征抽取时,与特征抽取相关的处理逻辑还可包括用于进行特征汇总的处理逻辑。
作为示例,时间窗口可由来源数据表、划分基准字段、时间基准字段、时间跨度和/或窗口大小来定义。具体说来,时间窗口的来源数据表即在该时间窗口下进行特征抽取所基于的数据表;时间窗口的划分基准字段即对来源数据表中的数据记录进行分组(也即,分片)所基于的字段(例如,用户ID);时间窗口的时间基准字段即对每组数据记录进行排序所基于的字段(例如,用户刷卡时间);时间窗口的时间跨度即该时间窗口内的数据记录的时间基准字段所对应的时间范围(例如,一周),时间窗口的窗口大小即该时间窗口内的数据的数量,窗口大小为大于零的整数。应该理解,在定义时间窗口时,可设置时间跨度和窗口大小之一,也可设置时间跨度和窗口大小两者。
作为示例,当与特征抽取相关的处理逻辑涉及多个时间窗口下进行特征抽取时,所述多个时间窗口彼此不同,即,所述多个时间窗口之间至少有以下项之中的至少一项不同:来源数据表、划分基准字段、时间基准字段、时间跨度、窗口大小。
作为示例,与特征抽取相关的处理逻辑可涉及在窗口大小为1的时间窗口下进行非时序特征抽取和/或在窗口大小不为1的时间窗口下进行时序特征抽取。
关于时序特征抽取,一般在处理时间序列数据时,需要进行时序特征抽取。时间序列数据具有很强的序列性,而且前后数据一般存在依赖,周期等关系。例如,交易数据可能随着时间变化会呈现出强烈的相关性,从而可以将交易数据的统计结果作为样本的特征。因此,可基于时间窗口来生成反映 时序性行为的特征(例如,近期交易习惯(诸如金额)等),一般需要指定时间序列数据的维度(即,时间窗口的划分基准字段),例如,是按照自然人(例如,用户ID)抽取相关特征(例如,关于交易金额的时序统计特征)还是按照发生交易的卡号抽取所述相关特征;此外,还需要指定时序特征所涉及的历史数据的范围(即,时间窗口的时间跨度和/或窗口大小),例如,涉及最近一周内的交易金额等。与时序特征抽取对应的时间窗口能够指定当前待抽取特征所依赖的所有数据记录(包括当前数据记录和/或历史数据记录),进而可基于这些数据记录中的相关字段值来计算所述当前待抽取特征。
根据本公开的示例性实施例,可考虑在窗口大小为1的时间窗口下进行非时序特征抽取,从而使得能够利用统一的时间窗口设置来兼容时序特征和非时序特征两者的抽取。然而,应理解,本公开的示例性实施例也可不在时间窗口下进行非时序特征抽取。
作为示例,当与特征抽取相关的处理逻辑仅涉及非时序特征抽取时,与特征抽取相关的处理逻辑可不涉及时间窗口,即,不必设置任何用于抽取特征的时间窗口。
作为示例,当与特征抽取相关的处理逻辑涉及非时序特征抽取和时序特征抽取时,与特征抽取相关的处理逻辑可涉及在窗口大小为1的时间窗口下进行非时序特征抽取和在窗口大小不为1的时间窗口下进行时序特征抽取。
作为示例,可直接从外部获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本。作为另一示例,可基于用户通过输入框输入的用于定义与特征抽取相关的处理逻辑的代码和/或用户配置的用于定义与特征抽取相关的处理逻辑的配置项来获取特征抽取脚本。例如,所述方法可由用于执行机器学习过程的机器学习平台来执行,机器学习平台可响应于用户操作,向用户提供用于配置特征抽取过程的图形界面(例如,用于配置特征工程的界面),其中,所述图形界面可包括用于输入用于定义与特征抽取相关的处理逻辑的输入控件;然后,可接收用户在所述图形界面上对输入控件执行的输入操作,并根据所述输入操作来获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本。作为示例,所述输入控件可包括用于输入用于定义与特征抽取相关的处理逻辑的代码和/或配置项的内容输入框、和/或用于在关于与特征抽取相关的处理逻辑的候选配置项之间进行选择操作的选择控件。
在步骤S20中,解析获取的特征抽取脚本,以生成用于进行特征抽取的执行计划。
作为示例,可按处理顺序分割获取的特征抽取脚本所定义的处理逻辑,来生成用于进行特征抽取的执行计划。由于特征抽取过程需要按照一定的处理顺序来执行,例如,特征抽取过程需要依次经过数据表拼接、从数据表获取特征、对生成的特征进行汇总等处理,因此,可按照特征抽取过程的处理顺序来分割获取的特征抽取脚本所定义的处理逻辑,例如,可将获取的特征抽取脚本所定义的处理逻辑分割为用于进行数据表拼接的处理逻辑部分、用于从数据表获取特征的处理逻辑部分、用于进行特征汇总的处理逻辑部分;然后,可基于分割后的各处理逻辑部分,生成用于进行特征抽取的执行计划。
作为示例,当获取的特征抽取脚本所定义的处理逻辑涉及在至少一个时间窗口下进行特征抽取时,可针对各个时间窗口,分别按处理顺序分割相应的处理逻辑来生成用于进行特征抽取的执行计划。也即,与不同时间窗口相应的处理逻辑不会被分割到同一处理逻辑部分中。例如,当获取的特征抽取脚本所定义的处理逻辑涉及在多个时间窗口下进行特征抽取时,可分别针对每个时间窗口,按特征抽取过程的处理顺序分割与该时间窗口相应的处理逻辑,例如,可将获取的特征抽取脚本所定义的处理逻辑分割为:分别针对每个时间窗口的用于进行数据表拼接的处理逻辑部分和用于从数据表获取特征的处理逻辑部分、以及用于对针对所有时间窗口生成的特征进行特征汇总的处理逻辑部分;然后,可基于分割后的各处理逻辑部分,生成用于进行特征抽取的执行计划。
作为示例,生成的用于进行特征抽取的执行计划可以是由节点构成的有向无环图(DAG图),其中,所述节点与分割后的处理逻辑对应。作为示例,所述节点可包括与用于从数据表获取特征的处理逻辑对应的计算节点。此外,所述节点还可包括与用于进行数据表拼接的处理逻辑对应的拼表节点和/或与用于进行特征汇总的处理逻辑对应的特征拼接节点。作为示例,针对不同时间窗口的用于从数据表获取特征的处理逻辑可对应不同的计算节点,用于拼接出不同数据表的处理逻辑可对应不同的拼表节点。应该理解,可基于分割后的各处理逻辑部分的输入变量和/或输出变量之间的关系,确定与各分割后的处理逻辑部分对应的节点之间的连接关系。
图2示出根据本公开示例性实施例的执行计划的示例。如图2所示,可 按特征抽取过程的处理顺序将获取的特征抽取脚本所定义的处理逻辑分割为:针对时间窗口1的用于进行数据表拼接的处理逻辑部分(例如,用于将数据表1和数据表2进行拼接来得到时间窗口1的来源数据表的处理逻辑部分)、针对时间窗口1的用于从数据表获取特征(即,进行特征抽取)的处理逻辑部分、针对时间窗口2的用于进行数据表拼接的处理逻辑部分(例如,用于将数据表1和数据表3进行拼接来得到时间窗口2的来源数据表的处理逻辑部分)、针对时间窗口2的用于从数据表获取特征的处理逻辑部分、用于对基于时间窗口1获取的特征和基于时间窗口2获取的特征进行特征汇总的处理逻辑部分。然后,可基于分割后的各处理逻辑部分生成如图2所示的由节点构成的有向无环图(也即,用于进行特征抽取的执行计划),其中,拼表节点1与针对时间窗口1的用于进行数据表拼接的处理逻辑部分对应、计算节点1与针对时间窗口1的用于从数据表获取特征的处理逻辑部分对应、拼表节点2与针对时间窗口2的用于进行数据表拼接的处理逻辑部分对应、计算节点2与针对时间窗口2的用于从数据表获取特征的处理逻辑部分对应、特征拼接节点与用于对基于时间窗口1获取的特征和基于时间窗口2获取的特征进行特征汇总的处理逻辑部分对应。
返回图1,在步骤S30中,基于特征抽取场景,通过本机或集群执行生成的执行计划。作为示例,特征抽取场景可以是在线特征抽取场景或离线特征抽取场景。
作为示例,当生成的执行计划是由节点构成的有向无环图时,可基于特征抽取场景,按照所述有向无环图中各节点之间的连接关系,通过本机或集群实现与各个节点对应的处理逻辑来执行生成的执行计划。
作为示例,通过本机或集群实现与计算节点对应的处理逻辑可包括:通过本机或集群直接运行计算节点。作为另一示例,通过本机或集群实现与计算节点对应的处理逻辑可包括:通过本机或集群将与计算节点对应的处理逻辑编译为至少一个可执行文件,并运行所述至少一个可执行文件。优选地,在编译为可执行文件时,可进行相应的优化。
作为优选示例,在将与计算节点对应的处理逻辑编译为可执行文件的过程中,可将所述处理逻辑之中的公共子表达式替换为中间变量。例如,当与计算节点对应的处理逻辑包括f1=discrete(max(col1))和f2=continous(max(col1))时,在将与计算节点对应的处理逻辑编译为可执行文 件的过程中,可将公共子表达式max(col1)作为中间变量,即,令a=max(col1),f1=discrete(a),f2=continous(a),这样,在执行对应的可执行文件时,只需计算一遍a的值即可,f1和f2能够复用a的计算结果。通过复用中间变量能够实现中间计算结果的复用,从而能够减少特征抽取过程的计算量、提高特征抽取过程的执行效率。
作为优选示例,在将与计算节点对应的处理逻辑编译为可执行文件的过程中,可将所述处理逻辑之中运算关系紧密且独立于其他处理逻辑的部分处理逻辑编译在同一个可执行文件中。例如,运算关系紧密且独立于其他处理逻辑的部分处理逻辑可以是使用相同的公共子表达式且与其他处理逻辑在逻辑上没有联系的部分处理逻辑,这样,所述部分处理逻辑能够共享中间变量,并且,由于不同可执行文件之间不会共享中间变量,因此,不同可执行文件可被并行执行。从而,根据上述方法,既能够复用编译器的JIT(Just-In-Time,即时编译)功能,提高编译的可执行文件中的代码的执行效率,又能够为并发执行特征抽取过程做好逻辑隔离,以实现并行地执行多个可执行文件。
作为示例,可分别针对每个计算节点,将与该计算节点对应的处理逻辑编译为至少一个可执行文件。
图3示出根据本公开的另一示例性实施例的统一地执行特征抽取的方法的流程图,这里,步骤S30具体包括步骤S301、步骤S302和步骤S303,步骤S10和步骤S20可参照根据图1和图2描述的具体实施方式来实现,在此不再赘述。
在步骤S301,确定特征抽取场景。
作为示例,可获取用户指定的特征抽取场景。例如,所述方法可由用于执行机器学习过程的机器学习平台来执行,机器学习平台可向用户提供用于指定特征抽取场景的图形界面,根据用户通过所述图形界面执行的输入操作来获取用户指定的特征抽取场景。
作为另一示例,可自动确定特征抽取场景。例如,当当前的机器学习场景为训练机器学习模型的机器学习场景时,可自动将特征抽取场景确定为离线特征抽取场景;当当前的机器学习场景为利用训练好的机器学习模型进行预估的机器学习场景时,可自动将特征抽取场景确定为在线特征抽取场景。
当在步骤S301确定特征抽取场景为在线特征抽取场景时,通过本机以单机模式执行生成的执行计划。作为示例,可通过本机基于内存数据库以单 机模式执行生成的执行计划。例如,可通过本机的内存数据库来实现用于进行数据表拼接的处理逻辑和/或用于进行特征汇总的处理逻辑。
当在步骤S301确定特征抽取场景为离线特征抽取场景时,执行步骤S303,通过集群以分布式模式执行生成的执行计划。换言之,可通过集群中的多个计算装置并行地执行生成的执行计划。应注意,这里所说的计算装置既可指示物理实体,也可指示虚拟实体,例如,计算装置可指示实际的计算机器,也可指示部署在该计算机器上的逻辑实体。
作为示例,可基于并行运算框架Spark来通过集群以分布式模式执行生成的执行计划。例如,可通过Spark的底层接口来实现用于进行数据表拼接的处理逻辑、用于进行特征汇总的处理逻辑等处理逻辑。例如,可基于Spark将生成的用于进行特征抽取的执行计划分发到集群中的各计算装置,以使各计算装置基于其内存储的数据来执行生成的执行计划并返回执行结果。此外,也可基于其他并行运算框架来通过集群以分布式模式执行生成的执行计划。
作为示例,步骤S303可包括:向用户提供候选集群的列表;然后,通过用户从列表中选择的集群以分布式模式执行生成的执行计划。
根据本公开示例性实施例的统一地执行特征抽取的方法,针对同一特征抽取脚本,由本机或集群根据特征抽取场景来执行统一的执行计划。作为示例,在在线特征抽取场景下,通过本机执行生成的执行计划,在离线特征抽取场景下,通过集群执行生成的执行计划,一方面,能够兼容在线特征抽取场景和离线特征抽取场景,实现了在线特征抽取场景和离线特征抽取场景的无缝对接,从而,无需针对同一特征抽取脚本,分别进行在线特征抽取场景和离线特征抽取场景下的具体运行方式的开发,减少了开发人员的工作量;另一方面,能够在离线特征抽取场景下,以较高吞吐量的方式高效地进行特征抽取,在线特征抽取场景下,能够高实时性、低延时地进行特征抽取。
作为示例,步骤S303可包括:通过集群以分布式模式执行以下操作来实现与计算节点对应的在时间窗口下进行特征抽取的处理逻辑:将该时间窗口的来源数据表中具有相同划分基准字段值的数据记录划分为同一个组(即,不同组对应不同的划分基准字段值),并将同一个组内的数据记录按照时间基准字段值从小到大的顺序(即,时间基准字段值所对应的时间的先后顺序)进行排序;然后,基于同一个组内的排序后的数据记录,在该时间窗口下进行特征抽取,具体地,针对当前数据记录,可利用其所对应的时间窗口内的 各条数据记录,针对每个特征所依赖的来源字段的取值进行处理来得到所述每个特征,其中,时间窗口内的各条数据记录是通过时间跨度和/或窗口大小而从所对应的组内筛选出来的。
作为示例,步骤S302可包括:通过本机以单机模式执行以下操作来实现与计算节点对应的在时间窗口下进行特征抽取的处理逻辑:针对当前数据记录,可利用其所对应的时间窗口内的各条数据记录,针对每个特征所依赖的来源字段的取值进行处理来得到所述每个特征,其中,时间窗口内的各条数据记录是通过时间跨度和/或窗口大小而从所对应的组内筛选出来的。
图4示出根据本公开示例性实施例的统一地执行特征抽取的系统的框图。如图4所示,根据本公开示例性实施例的统一地执行特征抽取的系统包括:脚本获取装置10、计划生成装置20和计划执行装置30。
具体说来,脚本获取装置10用于获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本。
这里,与特征抽取相关的处理逻辑可包括任何与特征抽取有关的处理逻辑。作为示例,与特征抽取相关的处理逻辑可包括用于从数据表获取特征的处理逻辑。这里,所述数据表可以是原始数据表,也可以是对原始数据表进行处理(例如,对多个原始数据表进行拼接)后得到的数据表。
作为示例,当所述数据表是对多个原始数据表进行拼接后得到的数据表时,与特征抽取相关的处理逻辑还可包括用于进行数据表拼接的处理逻辑。作为优选示例,用于进行数据表拼接的处理逻辑可包括用于针对特征的来源字段进行数据表拼接的处理逻辑,这里,用于针对特征的来源字段进行数据表拼接的处理逻辑即:用于仅将待拼接的数据表中的特征来源字段进行拼接以形成新的数据表的处理逻辑。
作为示例,与特征抽取相关的处理逻辑可涉及在一个或多个时间窗口下进行特征抽取。这里,时间窗口可用于筛选出生成特征时所需要依赖的一条或多条数据记录,其中,在时间窗口被设置为仅包括一条数据记录时,其可用于生成非时序特征,而在时间窗口被设置为包括多条数据记录时,其可用于生成时序特征。应该理解,与特征抽取相关的处理逻辑可涉及在每个时间窗口下进行一个或多个特征的抽取。作为示例,当与特征抽取相关的处理逻辑涉及在多个时间窗口下进行特征抽取时,与特征抽取相关的处理逻辑还可包括用于进行特征汇总的处理逻辑。
作为示例,时间窗口可由来源数据表、划分基准字段、时间基准字段、时间跨度和/或窗口大小来定义。具体说来,时间窗口的来源数据表即在该时间窗口下进行特征抽取所基于的数据表;时间窗口的划分基准字段即对来源数据表中的数据记录进行分组(也即,分片)所基于的字段(例如,用户ID);时间窗口的时间基准字段即对每组数据记录进行排序所基于的字段(例如,用户刷卡时间);时间窗口的时间跨度即该时间窗口内的数据记录的时间基准字段所对应的时间范围(例如,一周);时间窗口的窗口大小即该时间窗口内的数据的数量,窗口大小为大于零的整数。应该理解,在定义时间窗口时,可设置时间跨度和窗口大小之一,也可设置时间跨度和窗口大小两者。
作为示例,当与特征抽取相关的处理逻辑涉及多个时间窗口下进行特征抽取时,所述多个时间窗口彼此不同,即,所述多个时间窗口之间至少有以下项之中的至少一项不同:来源数据表、划分基准字段、时间基准字段、时间跨度、窗口大小。
作为示例,与特征抽取相关的处理逻辑可涉及在窗口大小为1的时间窗口下进行非时序特征抽取和/或在窗口大小不为1的时间窗口下进行时序特征抽取。
作为示例,脚本获取装置10可直接从外部获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本。作为另一示例,脚本获取装置10可基于用户通过输入框输入的用于定义与特征抽取相关的处理逻辑的代码和/或用户配置的用于定义与特征抽取相关的处理逻辑的配置项来获取特征抽取脚本。
计划生成装置20用于解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划。
作为示例,计划生成装置20可按处理顺序分割所述特征抽取脚本所定义的处理逻辑,来生成用于进行特征抽取的执行计划。
作为示例,当所述处理逻辑涉及在至少一个时间窗口下进行特征抽取时,计划生成装置20可针对各个时间窗口,分别按处理顺序分割相应的处理逻辑来生成用于进行特征抽取的执行计划。
作为示例,生成的用于进行特征抽取的执行计划可以是由节点构成的有向无环图(DAG图),其中,所述节点与分割后的处理逻辑对应。作为示例,所述节点可包括与用于从数据表获取特征的处理逻辑对应的计算节点。此外,所述节点还可包括与用于进行数据表拼接的处理逻辑对应的拼表节点和/或 与用于进行特征汇总的处理逻辑对应的特征拼接节点。作为示例,针对不同时间窗口的用于从数据表获取特征的处理逻辑可对应不同的计算节点,用于拼接出不同数据表的处理逻辑可对应不同的拼表节点。应该理解,可基于分割后的各处理逻辑部分的输入变量和/或输出变量之间的关系,确定与各分割后的处理逻辑部分对应的节点之间的连接关系。
计划执行装置30用于基于特征抽取场景,通过本机或集群执行生成的执行计划。作为示例,特征抽取场景可以是在线特征抽取场景或离线特征抽取场景。
作为示例,计划执行装置30可获取用户指定的特征抽取场景。例如,所述系统可部署在用于执行机器学习过程的机器学习平台上,显示装置可向用户提供用于指定特征抽取场景的图形界面,计划执行装置30可根据用户通过所述图形界面执行的输入操作来获取用户指定的特征抽取场景。
作为另一示例,计划执行装置30可自动确定特征抽取场景。例如,当当前的机器学习场景为训练机器学习模型的机器学习场景时,计划执行装置30可自动将特征抽取场景确定为离线特征抽取场景;当当前的机器学习场景为利用训练好的机器学习模型进行预估的机器学习场景时,计划执行装置30可自动将特征抽取场景确定为在线特征抽取场景。
作为示例,计划执行装置30可当特征抽取场景为在线特征抽取场景时,通过本机以单机模式执行生成的执行计划。作为示例,所述系统可部署在用于执行机器学习过程的机器学习平台上,本机即当前使用所述机器学习平台进行特征抽取的计算装置。
作为示例,计划执行装置30可当特征抽取场景为离线特征抽取场景时,通过集群以分布式模式执行生成的执行计划。
作为示例,计划执行装置30可基于并行运算框架Spark来通过集群以分布式模式执行生成的执行计划。
作为示例,当所述执行计划是由节点构成的有向无环图时,计划执行装置30可基于特征抽取场景,通过本机或集群实现与各个节点对应的处理逻辑来执行生成的执行计划。
作为示例,计划执行装置30可通过本机或集群将与计算节点对应的处理逻辑编译为至少一个可执行文件,并运行所述至少一个可执行文件。优选地,计划执行装置30在编译可执行文件时,可进行相应的优化。
作为示例,计划执行装置30可在将与计算节点对应的处理逻辑编译为可执行文件的过程中,将所述处理逻辑之中的公共子表达式替换为中间变量。
作为示例,计划执行装置30可在将与计算节点对应的处理逻辑编译为可执行文件的过程中,将所述处理逻辑之中运算关系紧密且独立于其他处理逻辑的部分处理逻辑编译在同一个可执行文件中。
作为示例,计划执行装置30可当特征抽取场景为离线特征抽取场景时,向用户提供候选集群的列表;并通过用户从列表中选择的集群以分布式模式执行生成的执行计划。
应该理解,根据本公开示例性实施例的统一地执行特征抽取的系统的具体实现方式可参照结合图1至图3描述的相关具体实现方式来实现,在此不再赘述。
根据本公开示例性实施例的统一地执行特征抽取的系统所包括的装置可被分别配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,这些装置可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的模块。此外,这些装置所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。
应理解,根据本公开示例性实施例的统一地执行特征抽取的方法可通过记录在计算可读介质上的程序来实现,例如,根据本公开的示例性实施例,可提供一种用于统一地执行特征抽取的计算机可读介质,其中,在所述计算机可读介质上记录有用于执行以下方法步骤的计算机程序:获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及基于特征抽取场景,通过本机或集群执行生成的执行计划。
上述计算机可读介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,应注意,所述计算机程序还可用于执行除了上述步骤以外的附加步骤或者在执行上述步骤时执行更为具体的处理,这些附加步骤和进一步处理的内容已经参照图1至图3进行了描述,这里为了避免重复将不再进行赘述。
应注意,根据本公开示例性实施例的统一地执行特征抽取的系统可完全依赖计算机程序的运行来实现相应的功能,即,各个装置与计算机程序的功能架构中与各步骤相应,使得整个系统通过专门的软件包(例如,lib库)而 被调用,以实现相应的功能。
另一方面,根据本公开示例性实施例的统一地执行特征抽取的系统所包括的各个装置也可以通过硬件、软件、固件、中间件、微代码或其任意组合来实现。当以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读介质中,使得处理器可通过读取并运行相应的程序代码或者代码段来执行相应的操作。
例如,本公开的示例性实施例还可以实现为计算装置,该计算装置包括存储部件和处理器,存储部件中存储有计算机可执行指令集合,当所述计算机可执行指令集合被所述处理器执行时,执行统一地执行特征抽取的方法。
具体说来,所述计算装置可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点装置上。此外,所述计算装置可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或其他能够执行上述指令集合的装置。
这里,所述计算装置并非必须是单个的计算装置,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。计算装置还可以是集成控制系统或系统管理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。
在所述计算装置中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。
根据本公开示例性实施例的统一地执行特征抽取的方法中所描述的某些操作可通过软件方式来实现,某些操作可通过硬件方式来实现,此外,还可通过软硬件结合的方式来实现这些操作。
处理器可运行存储在存储部件之一中的指令或代码,其中,所述存储部件还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,所述网络接口装置可采用任何已知的传输协议。
存储部件可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储部件可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储部件和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理 器能够读取存储在存储部件中的文件。
此外,所述计算装置还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。计算装置的所有组件可经由总线和/或网络而彼此连接。
根据本公开示例性实施例的统一地执行特征抽取的方法所涉及的操作可被描述为各种互联或耦合的功能块或功能示图。然而,这些功能块或功能示图可被均等地集成为单个的逻辑装置或按照非确切的边界进行操作。
例如,如上所述,根据本公开示例性实施例的统一地执行特征抽取的计算装置可包括存储部件和处理器,其中,存储部件中存储有计算机可执行指令集合,当所述计算机可执行指令集合被所述处理器执行时,执行下述步骤:获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及基于特征抽取场景,通过本机或集群执行生成的执行计划。
以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。

Claims (28)

  1. 一种由至少一个计算装置统一地执行特征抽取的方法,其中,所述方法包括:
    获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;
    解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及
    基于特征抽取场景,通过本机或集群执行生成的执行计划。
  2. 根据权利要求1所述的方法,其中,基于特征抽取场景通过本机或集群执行生成的执行计划的步骤包括:
    当特征抽取场景为在线特征抽取场景时,通过本机以单机模式执行生成的执行计划;以及
    当特征抽取场景为离线特征抽取场景时,通过集群以分布式模式执行生成的执行计划。
  3. 根据权利要求1所述的方法,其中,解析所述特征抽取脚本以生成用于进行特征抽取的执行计划的步骤包括:
    按处理顺序分割所述特征抽取脚本所定义的处理逻辑,来生成用于进行特征抽取的执行计划。
  4. 根据权利要求3所述的方法,其中,所述处理逻辑涉及在至少一个时间窗口下进行特征抽取,并且,
    按处理顺序分割所述特征抽取脚本所定义的处理逻辑来生成用于进行特征抽取的执行计划的步骤包括:针对各个时间窗口,分别按处理顺序分割相应的处理逻辑来生成用于进行特征抽取的执行计划。
  5. 根据权利要求4所述的方法,其中,所述执行计划是由节点构成的有向无环图,其中,所述节点与分割后的处理逻辑对应,
    并且,基于特征抽取场景通过本机或集群执行生成的执行计划的步骤包括:基于特征抽取场景,通过本机或集群实现与各个节点对应的处理逻辑来执行生成的执行计划。
  6. 根据权利要求5所述的方法,其中,所述节点包括与用于从数据表获取特征的处理逻辑对应的计算节点。
  7. 根据权利要求6所述的方法,其中,所述节点还包括与用于进行数据表拼接的处理逻辑对应的拼表节点和与用于进行特征汇总的处理逻辑对应的 特征拼接节点中的至少一个。
  8. 根据权利要求6所述的方法,其中,通过本机或集群实现与计算节点对应的处理逻辑包括:通过本机或集群将与计算节点对应的处理逻辑编译为至少一个可执行文件,并运行所述至少一个可执行文件,
    其中,包括如下两项中的至少一项:在将与计算节点对应的处理逻辑编译为可执行文件的过程中,将所述处理逻辑之中的公共子表达式替换为中间变量;将所述处理逻辑之中运算关系紧密且独立于其他处理逻辑的部分处理逻辑编译在同一个可执行文件中。
  9. 根据权利要求4所述的方法,其中,所述时间窗口由来源数据表、划分基准字段、时间基准字段、时间跨度和窗口大小中的至少一个来定义。
  10. 根据权利要求1所述的方法,其中,所述特征抽取场景由用户指定或自动确定。
  11. 根据权利要求2所述的方法,其中,当特征抽取场景为离线特征抽取场景时,通过集群以分布式模式执行生成的执行计划的步骤包括:
    当特征抽取场景为离线特征抽取场景时,向用户提供候选集群的列表;以及
    通过用户从列表中选择的集群以分布式模式执行生成的执行计划。
  12. 根据权利要求7所述的方法,其中,用于进行数据表拼接的处理逻辑包括用于针对特征的来源字段进行数据表拼接的处理逻辑。
  13. 根据权利要求9所述的方法,其中,所述处理逻辑涉及在窗口大小为1的时间窗口下进行非时序特征抽取和在窗口大小不为1的时间窗口下进行时序特征抽取中的至少一个。
  14. 一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行统一地执行特征抽取的以下步骤:
    获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;
    解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及
    基于特征抽取场景,通过本机或集群执行生成的执行计划。
  15. 根据权利要求14所述的系统,其中,基于特征抽取场景通过本机或集群执行生成的执行计划的步骤包括:
    当特征抽取场景为在线特征抽取场景时,通过本机以单机模式执行生成 的执行计划;以及
    当特征抽取场景为离线特征抽取场景时,通过集群以分布式模式执行生成的执行计划。
  16. 根据权利要求14所述的系统,其中,解析所述特征抽取脚本以生成用于进行特征抽取的执行计划的步骤包括:
    按处理顺序分割所述特征抽取脚本所定义的处理逻辑,来生成用于进行特征抽取的执行计划。
  17. 根据权利要求16所述的系统,其中,所述处理逻辑涉及在至少一个时间窗口下进行特征抽取,并且,
    按处理顺序分割所述特征抽取脚本所定义的处理逻辑来生成用于进行特征抽取的执行计划的步骤包括:针对各个时间窗口,分别按处理顺序分割相应的处理逻辑来生成用于进行特征抽取的执行计划。
  18. 根据权利要求17所述的系统,其中,所述执行计划是由节点构成的有向无环图,其中,所述节点与分割后的处理逻辑对应,
    并且,基于特征抽取场景通过本机或集群执行生成的执行计划的步骤包括:基于特征抽取场景,通过本机或集群实现与各个节点对应的处理逻辑来执行生成的执行计划。
  19. 根据权利要求18所述的系统,其中,所述节点包括与用于从数据表获取特征的处理逻辑对应的计算节点。
  20. 根据权利要求19所述的系统,其中,所述节点还包括与用于进行数据表拼接的处理逻辑对应的拼表节点和与用于进行特征汇总的处理逻辑对应的特征拼接节点中的至少一个。
  21. 根据权利要求19所述的系统,其中,通过本机或集群实现与计算节点对应的处理逻辑包括:通过本机或集群将与计算节点对应的处理逻辑编译为至少一个可执行文件,并运行所述至少一个可执行文件,
    其中,包括如下两项中的至少一项:在将与计算节点对应的处理逻辑编译为可执行文件的过程中,将所述处理逻辑之中的公共子表达式替换为中间变量;将所述处理逻辑之中运算关系紧密且独立于其他处理逻辑的部分处理逻辑编译在同一个可执行文件中。
  22. 根据权利要求17所述的系统,其中,所述时间窗口由来源数据表、划分基准字段、时间基准字段、时间跨度和窗口大小中的至少一个来定义。
  23. 根据权利要求14所述的系统,其中,所述特征抽取场景由用户指定或自动确定。
  24. 根据权利要求15所述的系统,其中,通过集群以分布式模式执行生成的执行计划的步骤包括:当特征抽取场景为离线特征抽取场景时,向用户提供候选集群的列表;并通过用户从列表中选择的集群以分布式模式执行生成的执行计划。
  25. 根据权利要求20所述的系统,其中,用于进行数据表拼接的处理逻辑包括用于针对特征的来源字段进行数据表拼接的处理逻辑。
  26. 根据权利要求22所述的系统,其中,所述处理逻辑涉及在窗口大小为1的时间窗口下进行非时序特征抽取和在窗口大小不为1的时间窗口下进行时序特征抽取中的至少一个。
  27. 一种统一地执行特征抽取的系统,其中,所述系统包括:
    脚本获取装置,获取用于定义与特征抽取相关的处理逻辑的特征抽取脚本;
    计划生成装置,解析所述特征抽取脚本,以生成用于进行特征抽取的执行计划;以及
    计划执行装置,基于特征抽取场景,通过本机或集群执行生成的执行计划。
  28. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求1到13中的任一权利要求所述的统一地执行特征抽取的方法。
PCT/CN2019/101649 2018-08-21 2019-08-20 统一地执行特征抽取的方法及系统 WO2020038376A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/270,248 US20210326761A1 (en) 2018-08-21 2019-08-20 Method and System for Uniform Execution of Feature Extraction
EP19852643.6A EP3842940A4 (en) 2018-08-21 2019-08-20 METHOD AND SYSTEM FOR CONTINUOUSLY PERFORMING FEATURE EXTRACTION

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810954494.5 2018-08-21
CN201810954494.5A CN109144648B (zh) 2018-08-21 2018-08-21 统一地执行特征抽取的方法及系统

Publications (1)

Publication Number Publication Date
WO2020038376A1 true WO2020038376A1 (zh) 2020-02-27

Family

ID=64790714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/101649 WO2020038376A1 (zh) 2018-08-21 2019-08-20 统一地执行特征抽取的方法及系统

Country Status (4)

Country Link
US (1) US20210326761A1 (zh)
EP (1) EP3842940A4 (zh)
CN (2) CN109144648B (zh)
WO (1) WO2020038376A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109144648B (zh) * 2018-08-21 2020-06-23 第四范式(北京)技术有限公司 统一地执行特征抽取的方法及系统
CN110502579A (zh) * 2019-08-26 2019-11-26 第四范式(北京)技术有限公司 用于批量和实时特征计算的系统和方法
CN110633078B (zh) * 2019-09-20 2020-12-15 第四范式(北京)技术有限公司 一种实现自动生成特征计算代码的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104586402A (zh) * 2015-01-22 2015-05-06 清华大学深圳研究生院 一种人体活动的特征提取方法
US20150379427A1 (en) * 2014-06-30 2015-12-31 Amazon Technologies, Inc. Feature processing tradeoff management
CN105677353A (zh) * 2016-01-08 2016-06-15 北京物思创想科技有限公司 特征抽取方法、机器学习方法及其装置
CN106779088A (zh) * 2016-12-06 2017-05-31 北京物思创想科技有限公司 执行机器学习流程的方法及系统
CN109144648A (zh) * 2018-08-21 2019-01-04 第四范式(北京)技术有限公司 统一地执行特征抽取的方法及系统

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100175049A1 (en) * 2009-01-07 2010-07-08 Microsoft Corporation Scope: a structured computations optimized for parallel execution script language
JP2012105205A (ja) * 2010-11-12 2012-05-31 Nikon Corp キーフレーム抽出装置、キーフレーム抽出プログラム、キーフレーム抽出方法、撮像装置、およびサーバ装置
CN104951425B (zh) * 2015-07-20 2018-03-13 东北大学 一种基于深度学习的云服务性能自适应动作类型选择方法
CN105760950B (zh) * 2016-02-05 2018-09-11 第四范式(北京)技术有限公司 提供或获取预测结果的方法、装置以及预测系统
CN106126641B (zh) * 2016-06-24 2019-02-05 中国科学技术大学 一种基于Spark的实时推荐系统及方法
CN106295703B (zh) * 2016-08-15 2022-03-25 清华大学 一种对时间序列进行建模并识别的方法
CN106407999A (zh) * 2016-08-25 2017-02-15 北京物思创想科技有限公司 结合规则来进行机器学习的方法及系统
WO2018069260A1 (en) * 2016-10-10 2018-04-19 Proekspert AS Data science versioning and intelligence systems and methods
US10963737B2 (en) * 2017-08-01 2021-03-30 Retina-Al Health, Inc. Systems and methods using weighted-ensemble supervised-learning for automatic detection of ophthalmic disease from images
CN111652380B (zh) * 2017-10-31 2023-12-22 第四范式(北京)技术有限公司 针对机器学习算法进行算法参数调优的方法及系统
CN108108657B (zh) * 2017-11-16 2020-10-30 浙江工业大学 基于多任务深度学习的修正局部敏感哈希车辆检索方法
CN107943463B (zh) * 2017-12-15 2018-10-16 清华大学 交互式自动化大数据分析应用开发系统
CN108228861B (zh) * 2018-01-12 2020-09-01 第四范式(北京)技术有限公司 用于执行机器学习的特征工程的方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150379427A1 (en) * 2014-06-30 2015-12-31 Amazon Technologies, Inc. Feature processing tradeoff management
CN104586402A (zh) * 2015-01-22 2015-05-06 清华大学深圳研究生院 一种人体活动的特征提取方法
CN105677353A (zh) * 2016-01-08 2016-06-15 北京物思创想科技有限公司 特征抽取方法、机器学习方法及其装置
CN106779088A (zh) * 2016-12-06 2017-05-31 北京物思创想科技有限公司 执行机器学习流程的方法及系统
CN109144648A (zh) * 2018-08-21 2019-01-04 第四范式(北京)技术有限公司 统一地执行特征抽取的方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3842940A4 *

Also Published As

Publication number Publication date
EP3842940A1 (en) 2021-06-30
US20210326761A1 (en) 2021-10-21
CN111949349A (zh) 2020-11-17
CN109144648A (zh) 2019-01-04
CN109144648B (zh) 2020-06-23
EP3842940A4 (en) 2022-05-04

Similar Documents

Publication Publication Date Title
US10515002B2 (en) Utilizing artificial intelligence to test cloud applications
US9256454B2 (en) Determining optimal methods for creating virtual machines
US20230254220A1 (en) Utilizing machine learning to reduce cloud instances in a cloud computing environment
WO2020038376A1 (zh) 统一地执行特征抽取的方法及系统
JP7059508B2 (ja) ビデオ時系列動作の検出方法、装置、電子デバイス、プログラム及び記憶媒体
JP6903755B2 (ja) データ統合ジョブ変換
WO2021037066A1 (zh) 用于批量和实时特征计算的系统和方法
US11789913B2 (en) Integration of model execution engine containers with a model development environment
CN110895718A (zh) 用于训练机器学习模型的方法及系统
CN108898229B (zh) 用于构建机器学习建模过程的方法及系统
AU2021285952B2 (en) Streamlining data processing optimizations for machine learning workloads
US11789775B2 (en) Progress visualization of computational job
US9870400B2 (en) Managed runtime cache analysis
US11573790B2 (en) Generation of knowledge graphs based on repositories of code
US20120311117A1 (en) Object Pipeline-Based Virtual Infrastructure Management
US10909021B2 (en) Assistance device, design assistance method, and recording medium storing design assistance program
CN108960433B (zh) 用于运行机器学习建模过程的方法及系统
Kushsairy et al. Embedded vision: Enhancing embedded platform for face detection system
US20180032929A1 (en) Risk-adaptive agile software development
US20240134832A1 (en) Integration of model execution engine containers with a model development environment
CN117270824A (zh) 一种项目构建方法、装置、设备和介质
Phannachitta et al. Scaling up analogy-based software effort estimation: A comparison of multiple hadoop implementation schemes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19852643

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019852643

Country of ref document: EP

Effective date: 20210322