WO2024131499A1 - Data analysis system, method and device - Google Patents

Data analysis system, method and device Download PDF

Info

Publication number
WO2024131499A1
WO2024131499A1 PCT/CN2023/135554 CN2023135554W WO2024131499A1 WO 2024131499 A1 WO2024131499 A1 WO 2024131499A1 CN 2023135554 W CN2023135554 W CN 2023135554W WO 2024131499 A1 WO2024131499 A1 WO 2024131499A1
Authority
WO
WIPO (PCT)
Prior art keywords
operator
feature
data
operators
layer
Prior art date
Application number
PCT/CN2023/135554
Other languages
French (fr)
Chinese (zh)
Inventor
李伟琪
黄飞腾
黄永强
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024131499A1 publication Critical patent/WO2024131499A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Definitions

  • the present application relates to the field of information technology, and more specifically, to a data analysis system, method and device.
  • time series data The data stored in a time series database is usually called time series data.
  • Time series analysis based on time series data can explore the deeper value inside the time series data.
  • Common time series analysis includes time series anomaly detection, time series prediction, clustering, association analysis, etc.
  • time series data with large data volume and complex and diverse formats basic time series anomaly detection and prediction scenarios can be provided.
  • cloud vendor AWS can provide low-computational intelligent analysis capabilities based on field attributes and data dimensions based on metadata.
  • a high-performance RCF algorithm a single algorithm can meet simple user needs. As the types of data gradually become richer and the database becomes larger, the complexity of algorithm recommendations based on the database increases significantly, and the efficiency of algorithm recommendations for time series data is also very low.
  • the present application provides a data analysis system, method and device, which adopts a step feature extraction framework to extract features of data of interest to the user, can quickly extract data features of the data source, and then quickly recommend suitable analysis operators for the data source, thereby improving the efficiency of data algorithm recommendation in complex data scenarios and the user experience.
  • a data analysis system comprising: a feature extraction module, configured to receive a first data set determined by a user from a data system; determining metadata corresponding to data of a first data subset based on the first data set, the first data subset being the first data set or a subset of the first data set; determining target features of the metadata based on at least two layers of feature operators, the two layers of feature operators comprising a first layer of operators and a second layer of operators, the first layer of operators being configured to extract the target features based on the metadata, the second layer of operators being configured to extract the target features based on the metadata when the first layer of operators fail to extract the target features, the complexity of the second layer of operators being higher than the complexity of the first layer of operators; and an operator recommendation module, configured to determine a first operator set based on the target features, the first operator set being used for analyzing the first data set.
  • the first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user.
  • the first data set is used as an input of the feature extraction module.
  • the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.
  • the first data set may include other data types, such as non-time series data, etc., which is not limited in the embodiments of the present application.
  • the feature extraction module includes multiple layers of operators with different complexities.
  • the calculation is performed layer by layer through the operator with the lowest complexity until the data features are captured.
  • the operator recommendation module then recommends a suitable set of operators.
  • This step-by-step feature extraction framework can improve the efficiency of data feature extraction, and then quickly recommend suitable analysis operators for the data source, thereby improving the efficiency of data algorithm recommendations and user experience in complex data scenarios.
  • the first-layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first-layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second-layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second-layer operator extracts the target feature.
  • operator layers of different complexity calculate the metadata to obtain feature values, and determine whether the data features are extracted based on whether the feature values meet preset conditions.
  • the preset conditions can be set according to user needs, and the method is more applicable.
  • the operator recommendation module is specifically used to determine the first operator set based on a reference feature set and the target feature, wherein the reference feature is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the two layers of operators.
  • the operator recommendation module determines the operator set of the target feature based on the feature set corresponding to the preset training data, so as to recommend a more suitable operator set.
  • the operator recommendation module is specifically used to determine the similarity between the target feature and the reference features in the reference feature set.
  • the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
  • the system further includes: an operator evaluation module, configured to evaluate the complexity of a custom operator, wherein the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
  • an operator evaluation module configured to evaluate the complexity of a custom operator, wherein the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
  • a data analysis method in a cloud service system is provided.
  • the method can be applied to a data analysis system architecture, or can be executed by components (such as chips or circuits) in the cloud service system architecture, without limitation.
  • the method includes: a feature extraction module receives a first data set determined by a user from a data system; the feature extraction module determines metadata corresponding to data of a first data subset based on the first data set, the first data subset being the first data set or a subset of the first data set; based on at least two layers of feature operators, a target feature of the metadata is determined, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators is used to extract the target feature based on the metadata, the second layer of operators is used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, the complexity of the second layer of operators is higher than the complexity of the first layer of operators; an operator recommendation module determines a first operator set based on the target feature, the first operator set is used for analyzing the first data set.
  • the first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user.
  • the first data set is used as an input of the feature extraction module.
  • the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.
  • the first data set may include other data types, such as non-time series data, etc., which is not limited in the embodiments of the present application.
  • the first layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.
  • the operator recommendation module determines the first operator set based on a reference feature set and the target feature, wherein the reference feature is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the two layers of operators.
  • the operator recommendation module determines the similarity between the target feature and the reference features in the reference feature set.
  • the similarity between the target feature and the first reference feature is greater than a similarity threshold
  • the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature
  • the first operator set belongs to the preset operator set
  • the first reference feature belongs to the reference feature set.
  • the system further includes: an operator evaluation module that evaluates the complexity of the custom operator, and the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
  • a cloud service system comprising: at least one processor, configured to execute a computer program or instruction stored in a memory, so as to execute the method in any possible implementation of the second aspect.
  • the device further comprises a memory, configured to store a computer program or instruction.
  • the device further comprises a communication interface, and the processor reads the computer program or instruction stored in the memory through the communication interface.
  • the present application provides a processor, comprising: an input circuit, an output circuit, and a processing circuit.
  • the processing circuit is used to receive a signal through the input circuit and transmit a signal through the output circuit, so that the processor executes any possible operation in the second aspect.
  • the processor may be one or more chips
  • the input circuit may be an input pin
  • the output circuit may be an output pin
  • the processing circuit may be a transistor, a gate circuit, a trigger, and various logic circuits.
  • the input signal received by the input circuit may be received and input by, for example, but not limited to, a transceiver
  • the signal output by the output circuit may be, for example, but not limited to, output to a transmitter and transmitted by the transmitter
  • the input circuit and the output circuit may be the same circuit, which is used as an input circuit and an output circuit at different times.
  • the embodiments of the present application do not limit the specific implementation methods of the processor and various circuits.
  • a processing device comprising a processor and a memory.
  • the processor is used to read instructions stored in the memory, and can receive signals through a transceiver and transmit signals through a transmitter to execute the method in any possible implementation of the second aspect.
  • the number of the processors is one or more, and the number of the memories is one or more.
  • the memory may be integrated with the processor, or the memory may be provided separately from the processor.
  • the memory can be a non-transitory memory, such as a read-only memory (ROM), which can be integrated with the processor on the same chip or can be separately set on different chips.
  • ROM read-only memory
  • the embodiments of the present application do not limit the type of memory and the setting method of the memory and the processor.
  • the related data interaction process can be a process of outputting indication information from the processor
  • receiving capability information can be a process of receiving input capability information by the processor.
  • the data output by the processor can be output to the transmitter, and the input data received by the processor can come from the transceiver.
  • the transmitter and the transceiver can be collectively referred to as a transceiver.
  • the processing device in the fifth aspect may be one or more chips.
  • the processor in the processing device may be implemented by hardware or software.
  • the processor When implemented by hardware, the processor may be a logic circuit, an integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor or located outside the processor and exist independently.
  • a chip which obtains instructions and executes the instructions to implement the method in the above-mentioned second aspect and any one of the implementation methods of the second aspect.
  • the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface to execute the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.
  • the chip may also include a memory, in which instructions are stored, and the processor is used to execute the instructions stored in the memory.
  • the processor is used to execute the method in the second aspect and any one of the implementation methods of the second aspect.
  • a computer program product comprising: a computer program code, when the computer program code is run on a computer, the computer executes the method in the above-mentioned second aspect and any one of the implementations of the second aspect.
  • a computer-readable storage medium comprising instructions; the instructions are used to implement the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.
  • these computer readable storages include, but are not limited to, one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), Flash memory, electrically EPROM (EEPROM), and hard drive.
  • ROM read-only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • Flash memory electrically EPROM (EEPROM)
  • hard drive electrically EPROM
  • the above-mentioned storage medium may specifically be a non-volatile storage medium.
  • a computing device comprising a processor and a memory, wherein the processor of the computing device is used to execute instructions stored in the memory so that the computing device executes any possible implementation method of the second aspect.
  • a computing node cluster which includes at least one computing node, each computing node includes a processor and a memory, and the processor of the at least one computing node is used to execute instructions stored in the memory of the at least one computing node, so that the computing node cluster executes any possible implementation method of the second aspect above.
  • FIG1 is a schematic diagram of a system architecture provided in an embodiment of the present application.
  • FIG. 2 is a schematic block diagram of a data analysis system provided in an embodiment of the present application.
  • FIG3 is a schematic flow chart of an operator layering method provided in the implementation of the present application.
  • FIG4 is a schematic diagram of the structure of a feature extraction module provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of a data analysis method provided in an embodiment of the present application.
  • FIG6 is a flowchart of an operator recommendation provided in an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a computing device 100 provided in the present application.
  • FIG8 is a schematic block diagram of a computing device cluster provided by the present application.
  • FIG. 9 is a schematic block diagram of another computing device cluster provided by the present application.
  • references to "one embodiment” or “some embodiments” etc. described in this specification mean that a particular feature, structure or characteristic described in conjunction with the embodiment is included in one or more embodiments of the present application.
  • the phrases “in one embodiment”, “in some embodiments”, “in some other embodiments”, “in some other embodiments”, etc. appearing in different places in this specification do not necessarily all refer to the same embodiment, but mean “one or more but not all embodiments", unless otherwise specifically emphasized in other ways.
  • the terms “including”, “comprising”, “having” and their variations all mean “including but not limited to”, unless otherwise specifically emphasized in other ways.
  • At least one means one or more
  • plural means two or more.
  • “And/or” describes the association relationship of associated objects, indicating that three relationships may exist.
  • a and/or B can mean: including the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A and B can be singular or plural.
  • the character “/” generally indicates that the previous and next associated objects are in an “or” relationship.
  • “At least one of the following” or similar expressions refers to any combination of these items, including any combination of single or plural items.
  • At least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple.
  • FIG1 shows a schematic diagram of a system architecture provided by an embodiment of the present application.
  • a client can access a cloud management platform via the Internet.
  • the cloud management platform is connected to the internal network of a data center.
  • a data center includes multiple servers.
  • the data center shown in the figure includes two servers.
  • server #1 includes a software layer and a hardware layer.
  • the software layer can include multiple virtual machines, and a host operating system, and the host operating system includes a virtual machine manager and a cloud management platform client;
  • the hardware layer can include a processor, memory, hard disk, network card, and data bus, etc.
  • the cloud management platform is used to provide an access interface (for example, the cloud management platform is used to provide an interface or an application programming interface (API)).
  • the tenant can operate the client remote access access interface to register a cloud account and password on the cloud management platform and log in to the cloud management platform.
  • the cloud management platform After the cloud management platform successfully authenticates the cloud account and password, the tenant can further pay to select and purchase a virtual machine of specific specifications (processor, memory, disk) on the cloud management platform. After the payment is successful, the cloud management platform provides the remote login account and password of the purchased virtual machine, and the client can remotely log in to the virtual machine, install and run the tenant's application in the virtual machine.
  • the cloud management platform client can be used to receive the control plane command sent by the cloud management platform, create and manage the virtual machine on the server according to the control plane control command, and perform full life cycle management on the virtual machine. Therefore, the tenant can create, manage, log in and operate the virtual machine in the cloud data center through the cloud management platform.
  • the virtual machine can also be called “cloud server (elastic compute service, ECS)", “elastic instance”, etc. Different cloud service providers have different names.
  • the database that stores time series data can accommodate large-scale time series data and support basic data analysis such as query and compression of time series data as well as aggregation, downsampling, and statistics in time series scenarios.
  • Time series analysis is a high-level analysis method for time series data that is independent of the basic analysis of time series databases. It includes time series anomaly detection, time series prediction, clustering, association analysis, etc. It uses different analysis methods to mine the deeper value of time series data. Common methods include statistical methods, Bayesian analysis methods, deep learning methods, and machine learning methods.
  • Feature extraction has many applications in machine learning, pattern recognition and image processing. Feature extraction starts from an initial measured data set and then constructs informative and non-redundant derived values, called features. It can help subsequent learning and inductive steps, and in some cases make it easier for people to make better interpretations of the data. Feature extraction is a dimensionality reduction step, where the initial data set is reduced to more manageable groups (features) for learning, while maintaining the accuracy and completeness of the description of the original data set.
  • Recommendation system mainly refers to the technology of using collaborative intelligence for recommendation.
  • Personalized recommendation system can effectively solve the problem of information overload.
  • Recommendation system provides users with a sorted personalized list of item recommendations based on their historical preferences and constraints.
  • a more accurate recommendation system can enhance and improve the user experience.
  • Recommendation results can usually be generated based on user preferences, product features, user-product transactions, and other environmental factors (such as time, season, location, etc.).
  • Recommended items can include movies, books, restaurants, news items, etc.
  • data features include data features of the data itself and/or extracted data features.
  • the features of the data itself are the features of the data in the time series.
  • the data of data features include: data of data arrangement period, data change trend data or data fluctuation data, etc.
  • the data arrangement period refers to the period involved in the data arrangement in the time series if the data is arranged periodically in the time series.
  • the data of the data arrangement period includes the period duration (that is, the time interval between two periods) and/or the number of periods;
  • the data change trend data is used to reflect the changing trend of the data arrangement in the time series (that is, the data change trend), for example, the data includes: continuous growth, continuous decline, first rise and then fall, first fall and then rise, or meet the normal distribution, etc.;
  • the data fluctuation data is used to reflect the fluctuation state of the data in the time series (that is, data fluctuation), for example, the data includes a function that characterizes the fluctuation curve of the time series, or a specified value of the time series, such as the maximum value, minimum value or average value.
  • the extracted data features are the features in the process of extracting the data in the time series.
  • the extracted features include statistical features, fitting features or frequency domain features, etc.
  • the extracted feature data include statistical feature data, fitting feature data or frequency domain feature data, etc.
  • Statistical features refer to the statistical features of time series. Statistical features are divided into quantitative features and attribute features, among which quantitative features are divided into measurement features and counting features. Quantitative features can be directly expressed by numerical values. For example, the consumption values of various resources such as CPU, memory, and IO resources are measurement features; the number of abnormalities and the number of devices working normally are counting features; attribute features cannot be directly expressed by numerical values, such as whether the device is abnormal or whether the device is downtime, etc.
  • the features in the statistical features are the indicators that need to be examined when statistics are performed.
  • the statistical feature data include moving average (Moving_average), weighted average (Weighted_mv), etc.
  • fitting features are the features of time series when fitting, and the fitting feature data is used to reflect the features of the time series used for fitting, for example, the fitting feature data includes the algorithm used when fitting, such as ARIMA
  • frequency domain features are the features of the time series in the frequency domain, and the frequency domain features are used to reflect the features of the time series in the frequency domain.
  • the frequency domain feature data includes: data on the regularity followed by the distribution of the time series in the frequency domain, such as the proportion of high-frequency components in the time series.
  • the frequency domain feature data can be obtained by performing wavelet decomposition on the time series.
  • time series anomaly detection and prediction scenarios can be provided for time series data with large data volumes and complex and diverse formats.
  • cloud vendor AWS can provide low-computational visualization and intelligent analysis capabilities based on field attributes and data dimensions based on metadata.
  • a high-performance RCF algorithm a single algorithm can meet simple user needs.
  • the academic community hopes to consider metadata and the data itself, but the amount of calculation will increase linearly with the number of preset algorithms and the amount of data, making it difficult to implement. Therefore, as the types of data gradually increase and the database grows larger, the complexity of algorithm recommendations based on the database increases significantly, and the efficiency of algorithm recommendations for time series data is also very low.
  • FIG. 2 is a schematic diagram of a data analysis system 200 proposed in an embodiment of the present application.
  • the system consists of four parts: front end, computing engine, database and recommendation engine.
  • the front end is used for users to execute operation commands and display operation results.
  • users can select the data source of interest on the UI front end, and the front end can display the recommendation page generated by the back end based on the selected data source, and generate a recommended operator set based on the automatically defined analysis task.
  • users can enter custom operators on the front end to expand feature operators based on preset operators.
  • the calculation engine is used to evaluate the user-defined operator.
  • the calculation engine includes an evaluation module, based on which the complexity of the user-defined operator is evaluated and the evaluation result is input into the database.
  • the database is used to store the operator library and the data input by the front end.
  • the custom operator evaluated by the computing engine can be added to the operator library, which includes multiple layers of operators of different complexity, and operators of different complexity are used to capture data features.
  • the system automatically loads the corresponding metadata and partial data sampling for the data input by the front end.
  • the recommendation engine is used to recommend operators for metadata and data sampling.
  • the recommendation engine includes a feature extraction module and an operator recommendation module.
  • the feature extraction module is used to calculate the feature values corresponding to different feature operators.
  • the operator recommendation module is used to determine the operator set corresponding to the features obtained by the feature extraction module based on the feature values obtained by the training data set and the preset algorithm set, thereby realizing the recommendation of the operator set.
  • training data and preset operator sets are stored in the database.
  • the training data can be understood as a data set determined by the user based on the importance of historical application needs.
  • the training data has been labeled, that is, the corresponding operators of the training data are known;
  • the preset operator set is a set of feature operators determined by the user based on historical calculations.
  • the feature extraction module uses a feature operator to calculate the data to obtain a feature value, wherein the feature operator can be processed in layers.
  • the specific hierarchical processing flow is as shown in FIG3 .
  • Step a Calculate the labeled training data set using all preset operator sets to obtain the eigenvalues of each training data set and the corresponding operator set, that is, the relationship matrix between the operator and the eigenvalue.
  • Step b obtain the relational matrix between operators through convolution between operators.
  • Step c weighting the relationship between operators and the performance of operators to obtain operator stratification.
  • operator stratification is to divide the associations and performance factors between the preset operators and comprehensive operators into different layers, and operators in different layers have different complexities.
  • FIG4 is a schematic diagram of the structure of a feature extraction module proposed in an embodiment of the present application.
  • FIG4 shows an operator hierarchical structure, where the first layer of operators are ultra-lightweight operators, the second layer of operators are lightweight operators, and the third layer of operators are heavy operators.
  • the complexity of ultra-lightweight operators is lower than that of lightweight operators, and the complexity of lightweight operators is lower than that of heavy operators.
  • the relationship between the above operators can be understood as data that cannot be processed by the lower-level operators can be processed by the upper-level operators. That is, if the ultra-lightweight operator cannot process it, it can be processed by the lightweight operator; if the lightweight operator cannot process it, it can be processed by the heavy operator.
  • the operator stratification may include at least two layers of feature operators.
  • the feature extraction module includes a first layer of operators and a second layer of operators.
  • the feature extraction module includes a first layer of operators, a second layer of operators, a third layer of operators, and a fourth layer of operators. This embodiment of the application is not limited to this.
  • the data that can be processed by the lower-layer operators does not need to be processed by the upper-layer operators, that is, no more calculations are performed. In other words, most of the data are stable and can be processed by the lower-layer operators to extract data features, thereby significantly improving the efficiency of data feature extraction.
  • the data analysis system proposed in the present application can be used for the analysis of time series data as well as for the analysis of other types of data, and the embodiments of the present application are not limited to this.
  • Fig. 5 is a flow chart of a method 500 for providing data analysis proposed in an embodiment of the present application, which specifically includes steps S510 to S530.
  • the feature extraction module is used to receive a first data set determined by a user from a data system.
  • the first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user.
  • the first data set serves as input to the feature extraction module.
  • the data system may be a database, etc.
  • the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.
  • the first data set is described by taking the first time series as an example.
  • the first data set may include other data types, such as non-time series data, etc., and the embodiments of the present application are not limited to this.
  • the feature extraction module determines metadata corresponding to the data of the first data subset according to the first data set.
  • the first data subset may be the first data set, or a subset of the first data set.
  • the feature extraction module can determine metadata corresponding to the data of the first data set based on the first data set, that is, after the user selects a data source of interest, the system can automatically load the metadata corresponding to the data source.
  • the feature extraction module may determine metadata corresponding to the subset of the first data set based on the subset of the first data set, that is, after the user selects a data source of interest, the system may automatically load sampled data of the data source and metadata corresponding to the sampled data.
  • the feature extraction module determines target features of the metadata based on at least two layers of feature operators.
  • the target features include data features of the first time series.
  • the target feature may include a target feature vector.
  • the feature extraction module calculates the input metadata to determine the target features.
  • the feature extraction module includes at least two layers of feature operators, namely, a first layer operator and a second layer operator.
  • the first layer operator is used to extract target features based on metadata.
  • the second layer operator extracts the target features based on metadata.
  • the complexity of the second layer operator is higher than that of the first layer operator.
  • the feature extraction module may also include a third layer operator, a fourth layer operator, etc.
  • the embodiment of the present application takes the first layer operator and the second layer operator of at least two layers of operators as an example, and the embodiment of the present application does not limit this.
  • the metadata After the metadata is input into the feature extraction module, it first passes through the first layer operator and obtains the first eigenvalue through calculation by the first layer operator. When the first eigenvalue meets the first condition, the first layer operator extracts the target feature and completes the feature extraction process. If the first eigenvalue does not meet the first condition, the metadata enters the second layer operator and obtains the second eigenvalue through calculation by the second layer operator. When the second eigenvalue meets the first condition, the second layer operator extracts the target feature and completes the feature extraction process.
  • the third layer operator calculates the third eigenvalue to determine whether the third eigenvalue meets the first condition, thereby determining whether the target feature is extracted, and so on, until the first condition is met and the feature extraction process is ended.
  • the eigenvalue may specifically be an eigenvalue vector, which includes multiple eigenvalues.
  • C is the eigenvalue
  • n is the number of eigenoperators
  • each eigenoperator corresponds to an eigenvalue.
  • the eigenvalue corresponding to the uncalculated eigenoperator is 0.
  • the uncalculated eigenoperator can be understood as follows: when the first eigenvalue vector satisfies the first condition and the second-layer operator calculation is not required, the second-layer operator does not need to be calculated, and the eigenvalue corresponding to the eigenoperator in the second-layer operator is 0.
  • extracting the target feature can be understood as the feature value calculated by the feature operator satisfies the first condition.
  • the first condition may be a threshold value.
  • eigenvalues in the eigenvalue vector are all greater than a threshold, that is, it is considered that the first condition is satisfied and the data feature is captured.
  • the first condition may also be an interval.
  • the eigenvalues in the eigenvalue vector are all within the interval, that is, it is considered that the first condition is satisfied and the data feature is captured.
  • the first condition may be any preset condition, and the embodiment of the present application does not limit this.
  • the first layer of operators is the lowest complexity operator layer
  • the second layer of operators is the operator layer with higher complexity than the first layer of operators
  • the complexity increases layer by layer.
  • Feature extraction starts from the lowest complexity operator layer. If the lower layer operator can capture the target feature, the feature extraction ends and no further calculation is required. Otherwise, it is necessary to go up layer by layer until the target feature is captured.
  • the following takes a feature extraction module shown in FIG4 as an example to explain in detail how the step-by-step feature extraction framework determines the target feature vector. Specific method.
  • the metadata is first calculated by the ultra-lightweight operator in this module to evaluate whether the data features can be captured. If the target features can be captured, the feature extraction process is completed and the eigenvalue vector is returned. If the target features are not captured, the lightweight operator continues to calculate. If the target features are captured, the feature extraction process is completed and the eigenvalue vector is returned. If the target features are not captured, the heavy operator continues to calculate and capture the target features.
  • the operator recommendation module determines a first operator set based on the target feature, where the first operator set is used to perform data analysis on the first data set.
  • the operator recommendation module is specifically configured to determine a first operator set based on a reference feature set and the target feature.
  • the reference feature set is a set of feature values calculated based on the training data set using a preset operator set.
  • Fig. 6 is a flowchart of an operator recommendation proposed in an embodiment of the present application. The operator recommendation process is described in detail in conjunction with Fig. 6.
  • Step a All training data sets are calculated using all preset operator sets T to obtain the eigenvalues corresponding to each operator set.
  • the eigenvalues corresponding to all operator sets constitute the reference feature set.
  • the preset operator set T, m is the number of operator sets.
  • T (t 1 t 2 t 3 ... t m )
  • the preset operator set includes part or all of the operators in at least two layers of operators in the feature extraction module.
  • each training set in the training data set has a corresponding operator set, so a mapping relationship matrix between the operator set and the eigenvalue can be constructed.
  • R is the mapping relationship matrix between operator sets and eigenvalues.
  • R i (r c1ti r c2ti r c3ti ... r cnti )
  • Ri is the eigenvector corresponding to operator set i.
  • Step b Determine the recommended operator result based on the mapping relationship matrix between the operator set and the eigenvalues in combination with the metadata.
  • the operator recommendation module can determine the similarity between the target feature and the reference feature in the reference feature set.
  • the similarity between the target feature and the first reference feature is greater than the similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
  • the similarity between the target feature and the first reference feature is negatively correlated with the distance between the target feature and the first reference feature. That is, the greater the similarity of the two features, the smaller the distance, and the smaller the similarity, the greater the distance.
  • the features in the present application may be in the form of feature vectors, and the following description will be given using feature vectors as an example.
  • the distance between the target feature vector and the first reference feature vector may be determined first, and the similarity may be determined based on the obtained distance.
  • the distance between the target feature vector and the first reference feature vector can be obtained in a variety of ways, which is not limited in this embodiment of the present application.
  • the newly selected data is extracted through step features to obtain a target feature vector Cu.
  • U i DT(R i , Cu )
  • ui is the distance of the target feature vector relative to the reference feature vector.
  • the first operator set corresponding to the reference feature vector with the shortest distance may be determined as the operator set of the target feature vector.
  • a similarity threshold is preset, and when the similarity reaches the threshold, the first operator set corresponding to the first reference feature vector can be considered as the operator set of the target feature vector.
  • the data driver module, the transmission control module, and the information acquisition module can be implemented by software or by hardware.
  • the implementation of the data driver module is described below.
  • the implementation of the transmission control module and the information acquisition module can refer to the implementation of the data driver module.
  • module is used as an example of a software functional unit.
  • the feature extraction module and the operator recommendation module may include code running on a computing instance.
  • the computing instance may be at least one of a physical host (computing device), a virtual machine, a container and other computing devices.
  • the above-mentioned computing device may be one or more.
  • a data-driven module may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the application can be distributed in the same region or in different regions.
  • the multiple hosts/virtual machines/containers used to run the code can be distributed in the same AZ or in different AZs, and each AZ includes one data center or multiple data centers with close geographical locations. Among them, usually one A region can include multiple AZs.
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same VPC or in multiple VPCs.
  • a VPC is set up in a region.
  • a communication gateway must be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
  • the feature extraction module may include at least one computing device, such as a server, etc.
  • the feature extraction module may also be a device implemented using ASIC or PLD, etc.
  • the PLD may be implemented using CPLD, FPGA, GAL or any combination thereof.
  • the multiple computing devices included in the feature extraction module can be distributed in the same region or in different regions.
  • the multiple computing devices included in the feature extraction module can be distributed in the same AZ or in different AZs.
  • the multiple computing devices included in the feature extraction module can be distributed in the same VPC or in multiple VPCs.
  • the multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the present application also provides a computing device 100.
  • the computing device 100 includes: a bus 102, a processor 104, a memory 106, and a communication interface 108.
  • the processor 104, the memory 106, and the communication interface 108 communicate with each other through the bus 102.
  • the computing device 100 may be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 100.
  • the bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus may be divided into an address bus, a data bus, a control bus, etc.
  • FIG8 is represented by only one line, but does not mean that there is only one bus or one type of bus.
  • the bus 104 may include a path for transmitting information between various components of the computing device 100 (e.g., the memory 106, the processor 104, the communication interface 108).
  • the processor 104 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).
  • processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the memory 106 may include a volatile memory, such as a random access memory (RAM).
  • the processor 104 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 106 stores executable program codes, and the processor 104 executes the executable program codes to respectively implement the functions of the aforementioned computing engine, database, feature extraction module, and operator recommendation module, thereby implementing the method of providing data analysis. That is, the memory 106 stores instructions for executing data analysis.
  • the memory 106 stores executable codes
  • the processor 104 executes the executable codes to respectively implement the functions of the aforementioned computing engine, database, feature extraction module, and operator recommendation module, thereby implementing the method for providing data analysis. That is, the memory 106 stores instructions for performing data analysis.
  • the communication interface 103 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.
  • a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.
  • the embodiment of the present application also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device can be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
  • the computing device cluster includes at least one computing device 100.
  • the memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for performing data analysis.
  • the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for performing data analysis.
  • the combination of one or more computing devices 100 may jointly execute instructions for performing data analysis.
  • the memory 106 in different computing devices 100 in the computing device cluster can store different instructions, and can implement the functions of one or more modules among the computing engine, database, feature extraction module and operator recommendation module.
  • one or more computing devices in a computing device cluster may be connected via a network.
  • the network may be a wide area network or a local area network, etc.
  • FIG. 9 shows a possible implementation. As shown in FIG. 9 , two computing devices 100A and 100B are connected via a network. Specifically, the network is connected via a communication interface in each computing device.
  • the memory 106 in the computing device 100A stores instructions for executing the functions of a computing engine and a database. At the same time, the memory 106 in the computing device 100B stores instructions for the functions of a feature extraction module and an operator recommendation module.
  • computing device 100A shown in FIG9 may also be performed by multiple computing devices 100.
  • the functionality of 100B may also be performed by multiple computing devices 100 .
  • the embodiment of the present application also provides another computing device cluster.
  • the connection relationship between the computing devices in the computing device cluster can be similar to the connection mode of the computing device cluster described in Figures 8 and 9.
  • the difference is that the memory 106 in one or more computing devices 100 in the computing device cluster can store the same instructions for executing the control transmission control scheme.
  • the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for performing data analysis.
  • the combination of one or more computing devices 100 may jointly execute instructions for performing data analysis.
  • the memory 106 in different computing devices 100 in the computing device cluster may store different instructions for executing part of the functions of the data analysis system. That is, the instructions stored in the memory 106 in different computing devices 100 may implement the functions of one or more modules among the computing engine, database, feature extraction module, and operator recommendation module.
  • the embodiment of the present application also provides a computer program product including instructions.
  • the computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium.
  • the at least one computing device executes method 500.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center that contains one or more available media.
  • the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk).
  • the computer-readable storage medium includes instructions that instruct the computing device to execute method 500.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art.
  • the computer software product is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data analysis system, comprising: a feature extraction module which comprises at least two layers of feature operators: a first layer of operators and a second layer of operators, wherein the first layer of operators extracts a target feature on the basis of metadata, when the target feature is not extracted by the first layer of operators, the second layer of operators extracts the target feature, the complexity of the second layer of operators is higher than that of the first layer of operators, the metadata comprises a data subset determined by a first user for a first data set, and the first data set comprises a set of data; and an operator recommendation module, used for determining a first operator set on the basis of the target feature, the first operator set being used for analyzing the first data set.

Description

一种数据分析系统、方法以及装置A data analysis system, method and device
本申请要求在2022年12月22日提交中国国家知识产权局、申请号为202211654936.7、发明名称为“一种数据分析系统、方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on December 22, 2022, with application number 202211654936.7 and invention name “A data analysis system, method and device”, the entire contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及信息技术领域,并且更具体地,涉及一种数据分析系统、方法以及装置。The present application relates to the field of information technology, and more specifically, to a data analysis system, method and device.
背景技术Background technique
随着信息技术的发展,对数据库的需求越来越大。目前数据库的种类也比较多,例如,关系型数据库和时序数据库。其中,对时序数据库的需求上升尤为明显。With the development of information technology, the demand for databases is increasing. Currently, there are many types of databases, such as relational databases and time series databases. Among them, the demand for time series databases has increased significantly.
时序数据库中存储的数据通常称为时序数据,基于时序数据展开的时序分析可以挖掘时序数据内部的更深层次的价值。常见的时序分析包括时序异常检测、时序预测、聚类、关联分析等。针对数据量大、格式复杂多样的时序数据,可以提供基本的时序异常检测以及预测场景,例如,云厂商aws可以基于元数据提供低计算量的以字段属性以及数据维度可视化智能分析的能力。再例如,高性能的RCF算法,单一的算法可以满足简单的用户需求。当数据种类逐渐丰富,数据库日益增大,基于数据库进行的算法推荐的复杂度明显上升,时序数据的算法推荐效率也很低。The data stored in a time series database is usually called time series data. Time series analysis based on time series data can explore the deeper value inside the time series data. Common time series analysis includes time series anomaly detection, time series prediction, clustering, association analysis, etc. For time series data with large data volume and complex and diverse formats, basic time series anomaly detection and prediction scenarios can be provided. For example, cloud vendor AWS can provide low-computational intelligent analysis capabilities based on field attributes and data dimensions based on metadata. For another example, a high-performance RCF algorithm, a single algorithm can meet simple user needs. As the types of data gradually become richer and the database becomes larger, the complexity of algorithm recommendations based on the database increases significantly, and the efficiency of algorithm recommendations for time series data is also very low.
因此,如何针对复杂的数据库进行算法推荐,提高算法推荐的效率以及提升用户体验,成为亟需要解决的技术问题。Therefore, how to recommend algorithms for complex databases, improve the efficiency of algorithm recommendations, and enhance user experience has become a technical problem that needs to be solved urgently.
发明内容Summary of the invention
本申请提供一种数据分析系统、方法以及装置,采用阶梯特征提取框架对用户感兴趣的数据进行特征提取,可以快速的提取数据源的数据特征,进而迅速的为数据源推荐出合适的分析算子,提升了复杂数据场景下数据算法推荐的效率以及用户的体验。The present application provides a data analysis system, method and device, which adopts a step feature extraction framework to extract features of data of interest to the user, can quickly extract data features of the data source, and then quickly recommend suitable analysis operators for the data source, thereby improving the efficiency of data algorithm recommendation in complex data scenarios and the user experience.
第一方面,提供了一种数据分析系统,包括:特征提取模块,用于接收用户从数据系统确定的第一数据集;根据所述第一数据集,确定第一数据子集的数据对应的元数据,所述第一数据子集是所述第一数据集或所述第一数据集的子集;基于至少两层特征算子,确定所述元数据的目标特征,所述两层特征算子包括第一层算子和第二层算子,所述第一层算子用于基于所述元数据提取所述目标特征,所述第二层算子用于当所述第一层算子未提取到所述目标特征时基于所述元数据提取所述目标特征,所述第二层算子的复杂度高于所述第一层算子的复杂度;算子推荐模块,用于基于所述目标特征确定第一算子集,所述第一算子集用于所述第一数据集的分析。In a first aspect, a data analysis system is provided, comprising: a feature extraction module, configured to receive a first data set determined by a user from a data system; determining metadata corresponding to data of a first data subset based on the first data set, the first data subset being the first data set or a subset of the first data set; determining target features of the metadata based on at least two layers of feature operators, the two layers of feature operators comprising a first layer of operators and a second layer of operators, the first layer of operators being configured to extract the target features based on the metadata, the second layer of operators being configured to extract the target features based on the metadata when the first layer of operators fail to extract the target features, the complexity of the second layer of operators being higher than the complexity of the first layer of operators; and an operator recommendation module, configured to determine a first operator set based on the target features, the first operator set being used for analyzing the first data set.
本申请中,第一数据集为用户从数据系统中选定的数据源,例如,用户选定的感兴趣的数据源。该第一数据集作为特征提取模块的输入。In the present application, the first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user. The first data set is used as an input of the feature extraction module.
本申请中,第一数据集可以是第一时间序列,第一时间序列包括按照时序排列的一组数据的集合。也即,第一数据集可以包括一组时序数据。In the present application, the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.
本申请中,第一数据集中可以包括其他数据类型,例如,非时序数据等,本申请实施例对此不作限定。In the present application, the first data set may include other data types, such as non-time series data, etc., which is not limited in the embodiments of the present application.
上述技术方案中,特征提取模块包括多层复杂度不同的算子,经由最低复杂度的算子逐层进行计算,直到捕捉到数据特征,再由算子推荐模块推荐合适的算子集,该阶梯式的特征提取框架能够提高数据特征提取效率,进而迅速的为数据源推荐出合适的分析算子,提升了复杂数据场景下数据算法推荐的效率以及用户的体验。In the above technical solution, the feature extraction module includes multiple layers of operators with different complexities. The calculation is performed layer by layer through the operator with the lowest complexity until the data features are captured. The operator recommendation module then recommends a suitable set of operators. This step-by-step feature extraction framework can improve the efficiency of data feature extraction, and then quickly recommend suitable analysis operators for the data source, thereby improving the efficiency of data algorithm recommendations and user experience in complex data scenarios.
在一种可能的实现方式中,所述第一层算子具体用于对所述元数据进行计算得到第一特征值,当所述第一特征值满足第一条件时,所述第一层算子提取到所述目标特征;当所述第一特征值不满足所述第一条件时,所述第二层算子具体用于,对所述元数据进行计算得到第二特征值,当所述第二特征值满足所述第一条件时,所述第二层算子提取到所述目标特征。 In a possible implementation, the first-layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first-layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second-layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second-layer operator extracts the target feature.
基于上述技术方案,不同复杂度的算子层对元数据进行计算得到特征值,根据特征值是否满足预设条件来判断是否提取到数据特征,预设条件可以根据用户需求设定,方法更具有适用性。Based on the above technical solution, operator layers of different complexity calculate the metadata to obtain feature values, and determine whether the data features are extracted based on whether the feature values meet preset conditions. The preset conditions can be set according to user needs, and the method is more applicable.
在一种可能的实现方式中,算子推荐模块具体用于基于参考特征集合和所述目标特征确定所述第一算子集,所述参考特征为基于训练数据集用预置算子集计算得到的特征值集合,所述预置算子集包括所述两层算子的部分或全部算子。In one possible implementation, the operator recommendation module is specifically used to determine the first operator set based on a reference feature set and the target feature, wherein the reference feature is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the two layers of operators.
基于上述技术方案,算子推荐模块基于预置的训练数据对应的特征集合确定目标特征的算子集,从而能够推荐更适合的算子集。Based on the above technical solution, the operator recommendation module determines the operator set of the target feature based on the feature set corresponding to the preset training data, so as to recommend a more suitable operator set.
在一种可能的实现方式中,算子推荐模块具体用于确定所述目标特征与所述参考特征集合中参考特征的相似度,当所述目标特征与第一参考特征的相似度大于相似度阈值时,确定所述第一参考特征对应的第一算子集为所述目标特征的算子集,所述第一算子集属于所述预置算子集,所述第一参考特征属于所述参考特征集合。In a possible implementation, the operator recommendation module is specifically used to determine the similarity between the target feature and the reference features in the reference feature set. When the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
在一种可能的实现方式中,所述系统还包括:算子评估模块,用于评估自定义算子的复杂度,所述自定义算子的复杂度用于确定将所述自定义算子内置于所述特征提取模块中的至少两层特征算子的其中一层。In a possible implementation, the system further includes: an operator evaluation module, configured to evaluate the complexity of a custom operator, wherein the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
第二方面,提供了一种云服务系统中的数据分析方法,该方法例如可以应用于数据分析系统架构中,或者,也可以由云服务系统架构中的组成部件(例如芯片或者电路)执行,对此不作限定。On the second aspect, a data analysis method in a cloud service system is provided. The method can be applied to a data analysis system architecture, or can be executed by components (such as chips or circuits) in the cloud service system architecture, without limitation.
该方法包括:特征提取模块接收用户从数据系统确定的第一数据集;所述特征提取模块根据所述第一数据集,确定第一数据子集的数据对应的元数据,所述第一数据子集是所述第一数据集或所述第一数据集的子集;基于至少两层特征算子,确定所述元数据的目标特征,所述两层特征算子包括第一层算子和第二层算子,所述第一层算子用于基于所述元数据提取所述目标特征,所述第二层算子用于当所述第一层算子未提取到所述目标特征时基于所述元数据提取所述目标特征,所述第二层算子的复杂度高于所述第一层算子的复杂度;算子推荐模块基于所述目标特征确定第一算子集,所述第一算子集用于所述第一数据集的分析。The method includes: a feature extraction module receives a first data set determined by a user from a data system; the feature extraction module determines metadata corresponding to data of a first data subset based on the first data set, the first data subset being the first data set or a subset of the first data set; based on at least two layers of feature operators, a target feature of the metadata is determined, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators is used to extract the target feature based on the metadata, the second layer of operators is used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, the complexity of the second layer of operators is higher than the complexity of the first layer of operators; an operator recommendation module determines a first operator set based on the target feature, the first operator set is used for analyzing the first data set.
本申请中,第一数据集为用户从数据系统中选定的数据源,例如,用户选定的感兴趣的数据源。该第一数据集作为特征提取模块的输入。In the present application, the first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user. The first data set is used as an input of the feature extraction module.
本申请中,第一数据集可以是第一时间序列,第一时间序列包括按照时序排列的一组数据的集合。也即,第一数据集可以包括一组时序数据。In the present application, the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.
本申请中,第一数据集中可以包括其他数据类型,例如,非时序数据等,本申请实施例对此不作限定。In the present application, the first data set may include other data types, such as non-time series data, etc., which is not limited in the embodiments of the present application.
在一种可能的实现方式中,第一层算子具体用于对所述元数据进行计算得到第一特征值,当所述第一特征值满足第一条件时,所述第一层算子提取到所述目标特征;当所述第一特征值不满足所述第一条件时,所述第二层算子具体用于,对所述元数据进行计算得到第二特征值,当所述第二特征值满足所述第一条件时,所述第二层算子提取到所述目标特征。In a possible implementation, the first layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.
在一种可能的实现方式中,算子推荐模块基于参考特征集合和所述目标特征确定所述第一算子集,所述参考特征为基于训练数据集用预置算子集计算得到的特征值集合,所述预置算子集包括所述两层算子的部分或全部算子。In one possible implementation, the operator recommendation module determines the first operator set based on a reference feature set and the target feature, wherein the reference feature is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the two layers of operators.
在一种可能的实现方式中,算子推荐模块确定所述目标特征与所述参考特征集合中参考特征的相似度,当所述目标特征与第一参考特征的相似度大于相似度阈值时,确定所述第一参考特征对应的第一算子集为所述目标特征的算子集,所述第一算子集属于所述预置算子集,所述第一参考特征属于所述参考特征集合。In one possible implementation, the operator recommendation module determines the similarity between the target feature and the reference features in the reference feature set. When the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
在一种可能的实现方式中,该系统还包括:算子评估模块评估自定义算子的复杂度,所述自定义算子的复杂度用于确定将所述自定义算子内置于所述特征提取模块中的至少两层特征算子的其中一层。In a possible implementation, the system further includes: an operator evaluation module that evaluates the complexity of the custom operator, and the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
第三方面,提供了一种云服务系统,该系统包括:至少一个处理器,用于执行存储器存储的计算机程序或指令,以执行上述第二方面中任一种可能实现方式中的方法。可选地,该装置还包括存储器,用于存储的计算机程序或指令。可选地,该装置还包括通信接口,处理器通过通信接口读取存储器存储的计算机程序或指令。In a third aspect, a cloud service system is provided, the system comprising: at least one processor, configured to execute a computer program or instruction stored in a memory, so as to execute the method in any possible implementation of the second aspect. Optionally, the device further comprises a memory, configured to store a computer program or instruction. Optionally, the device further comprises a communication interface, and the processor reads the computer program or instruction stored in the memory through the communication interface.
第四方面,本申请提供一种处理器,包括:输入电路、输出电路和处理电路。所述处理电路用于通过所述输入电路接收信号,并通过所述输出电路发射信号,使得所述处理器执行第二方面中任一种可能 实现方式中的方法。In a fourth aspect, the present application provides a processor, comprising: an input circuit, an output circuit, and a processing circuit. The processing circuit is used to receive a signal through the input circuit and transmit a signal through the output circuit, so that the processor executes any possible operation in the second aspect. Methods in implementation.
在具体实现过程中,上述处理器可以为一个或多个芯片,输入电路可以为输入管脚,输出电路可以为输出管脚,处理电路可以为晶体管、门电路、触发器和各种逻辑电路等。输入电路所接收的输入的信号可以是由例如但不限于收发器接收并输入的,输出电路所输出的信号可以是例如但不限于输出给发射器并由发射器发射的,且输入电路和输出电路可以是同一电路,该电路在不同的时刻分别用作输入电路和输出电路。本申请实施例对处理器及各种电路的具体实现方式不做限定。In the specific implementation process, the processor may be one or more chips, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a trigger, and various logic circuits. The input signal received by the input circuit may be received and input by, for example, but not limited to, a transceiver, and the signal output by the output circuit may be, for example, but not limited to, output to a transmitter and transmitted by the transmitter, and the input circuit and the output circuit may be the same circuit, which is used as an input circuit and an output circuit at different times. The embodiments of the present application do not limit the specific implementation methods of the processor and various circuits.
对于处理器所涉及的发送和获取/接收等操作,如果没有特殊说明,或者,如果未与其在相关描述中的实际作用或者内在逻辑相抵触,则可以理解为处理器输出和接收、输入等操作,也可以理解为由射频电路和天线所进行的发送和接收操作,本申请对此不做限定。For the operations such as sending and acquiring/receiving involved in the processor, unless otherwise specified, or unless they conflict with their actual function or internal logic in the relevant description, they can be understood as operations such as processor output, reception, input, etc., or as sending and receiving operations performed by the radio frequency circuit and antenna, and this application does not limit this.
第五方面,提供了一种处理设备,包括处理器和存储器。该处理器用于读取存储器中存储的指令,并可通过收发器接收信号,通过发射器发射信号,以执行第二方面任一种可能实现方式中的方法。In a fifth aspect, a processing device is provided, comprising a processor and a memory. The processor is used to read instructions stored in the memory, and can receive signals through a transceiver and transmit signals through a transmitter to execute the method in any possible implementation of the second aspect.
可选地,所述处理器为一个或多个,所述存储器为一个或多个。Optionally, the number of the processors is one or more, and the number of the memories is one or more.
可选地,所述存储器可以与所述处理器集成在一起,或者所述存储器与处理器分离设置。Optionally, the memory may be integrated with the processor, or the memory may be provided separately from the processor.
在具体实现过程中,存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(read only memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限定。In the specific implementation process, the memory can be a non-transitory memory, such as a read-only memory (ROM), which can be integrated with the processor on the same chip or can be separately set on different chips. The embodiments of the present application do not limit the type of memory and the setting method of the memory and the processor.
应理解,相关的数据交互过程例如发送指示信息可以为从处理器输出指示信息的过程,接收能力信息可以为处理器接收输入能力信息的过程。具体地,处理器输出的数据可以输出给发射器,处理器接收的输入数据可以来自收发器。其中,发射器和收发器可以统称为收发器。It should be understood that the related data interaction process, such as sending indication information, can be a process of outputting indication information from the processor, and receiving capability information can be a process of receiving input capability information by the processor. Specifically, the data output by the processor can be output to the transmitter, and the input data received by the processor can come from the transceiver. Among them, the transmitter and the transceiver can be collectively referred to as a transceiver.
上述第五方面中的处理设备可以是一个或多个芯片。该处理设备中的处理器可以通过硬件来实现也可以通过软件来实现。当通过硬件实现时,该处理器可以是逻辑电路、集成电路等;当通过软件来实现时,该处理器可以是一个通用处理器,通过读取存储器中存储的软件代码来实现,该存储器可以集成在处理器中,可以位于该处理器之外,独立存在。The processing device in the fifth aspect may be one or more chips. The processor in the processing device may be implemented by hardware or software. When implemented by hardware, the processor may be a logic circuit, an integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor or located outside the processor and exist independently.
第六方面,提供了一种芯片,该芯片获取指令并执行该指令来实现上述第二方面以及第二方面的任意一种实现方式中的方法。In a sixth aspect, a chip is provided, which obtains instructions and executes the instructions to implement the method in the above-mentioned second aspect and any one of the implementation methods of the second aspect.
可选地,作为一种实现方式,该芯片包括处理器与数据接口,该处理器通过该数据接口读取存储器上存储的指令,执行上述第二方面以及第二方面的任意一种实现方式中的方法。Optionally, as an implementation manner, the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface to execute the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.
可选地,作为一种实现方式,该芯片还可以包括存储器,该存储器中存储有指令,该处理器用于执行该存储器上存储的指令,当该指令被执行时,该处理器用于执行第二方面以及第二方面中的任意一种实现方式中的方法。Optionally, as an implementation method, the chip may also include a memory, in which instructions are stored, and the processor is used to execute the instructions stored in the memory. When the instructions are executed, the processor is used to execute the method in the second aspect and any one of the implementation methods of the second aspect.
第七方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行上述第二方面以及第二方面的任意一种实现方式中的方法。In a seventh aspect, a computer program product is provided, the computer program product comprising: a computer program code, when the computer program code is run on a computer, the computer executes the method in the above-mentioned second aspect and any one of the implementations of the second aspect.
第八方面,提供了一种计算机可读存储介质,包括指令;所述指令用于实现上述第二方面以及第二方面的任意一种实现方式中的方法。In an eighth aspect, a computer-readable storage medium is provided, comprising instructions; the instructions are used to implement the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.
作为示例,这些计算机可读存储包括但不限于如下的一个或者多个:只读存储器(read-only memory,ROM)、可编程ROM(programmable ROM,PROM)、可擦除的PROM(erasable PROM,EPROM)、Flash存储器、电EPROM(electrically EPROM,EEPROM)以及硬盘驱动器(hard drive)。By way of example, these computer readable storages include, but are not limited to, one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), Flash memory, electrically EPROM (EEPROM), and hard drive.
可选地,作为一种实现方式,上述存储介质具体可以是非易失性存储介质。Optionally, as an implementation manner, the above-mentioned storage medium may specifically be a non-volatile storage medium.
第九方面,提供一种计算设备,该计算设备包括处理器和存储器,所述一个计算设备的处理器用于执行所述存储器中存储的指令,以使得所述计算设备执行上述第二方面任一种可能实现方法。In a ninth aspect, a computing device is provided, comprising a processor and a memory, wherein the processor of the computing device is used to execute instructions stored in the memory so that the computing device executes any possible implementation method of the second aspect.
第十方面,提供一种计算节点集群,该计算节点集群包括至少一个计算节点,每个计算节点包括处理器和存储器,所述至少一个计算节点的处理器用于执行所述至少一个计算节点的存储器中存储的指令,以使得所述计算节点集群执行上述第二方面任一种可能实现方法。In a tenth aspect, a computing node cluster is provided, which includes at least one computing node, each computing node includes a processor and a memory, and the processor of the at least one computing node is used to execute instructions stored in the memory of the at least one computing node, so that the computing node cluster executes any possible implementation method of the second aspect above.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的一种系统架构的示意图。FIG1 is a schematic diagram of a system architecture provided in an embodiment of the present application.
图2是本申请实施例提供的一种数据分析的系统的示意性框图。 FIG. 2 is a schematic block diagram of a data analysis system provided in an embodiment of the present application.
图3是本申请实施提供的一种算子分层方法的示意流程图。FIG3 is a schematic flow chart of an operator layering method provided in the implementation of the present application.
图4是本申请实施例提供的一种特征提取模块的结构示意图。FIG4 is a schematic diagram of the structure of a feature extraction module provided in an embodiment of the present application.
图5是本申请实施例提供的一种数据分析方法的示意图。FIG5 is a schematic diagram of a data analysis method provided in an embodiment of the present application.
图6是本申请实施例提供的一种算子推荐流程图。FIG6 is a flowchart of an operator recommendation provided in an embodiment of the present application.
图7是本申请提供的一种计算设备100的示意性框图。FIG. 7 is a schematic block diagram of a computing device 100 provided in the present application.
图8是本申请提供的一种计算设备集群的示意性框图。FIG8 is a schematic block diagram of a computing device cluster provided by the present application.
图9是本申请提供的另一种计算设备集群的示意性框图。FIG. 9 is a schematic block diagram of another computing device cluster provided by the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请中的技术方案进行描述。The technical solution in this application will be described below in conjunction with the accompanying drawings.
本申请将围绕包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是,各个系统可以包括另外的设备、组件、模块等,并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外,还可以使用这些方案的组合。The present application will present various aspects, embodiments or features around a system including multiple devices, components, modules, etc. It should be understood and appreciated that each system may include additional devices, components, modules, etc., and/or may not include all devices, components, modules, etc. discussed in conjunction with the figures. In addition, combinations of these schemes may also be used.
另外,在本申请实施例中,“示例的”、“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。In addition, in the embodiments of the present application, words such as "exemplary" and "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplary" in the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of the word "exemplary" is intended to present concepts in a concrete way.
本申请实施例中,“相应的(corresponding,relevant)”和“对应的(corresponding)”有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。In the embodiments of the present application, “corresponding” and “relevant” may sometimes be used interchangeably. It should be noted that when the distinction between them is not emphasized, the meanings they intend to express are consistent.
本申请实施例描述的业务场景是为了更加清楚地说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着网络架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The business scenarios described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. A person of ordinary skill in the art can appreciate that, with the evolution of network architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。References to "one embodiment" or "some embodiments" etc. described in this specification mean that a particular feature, structure or characteristic described in conjunction with the embodiment is included in one or more embodiments of the present application. Thus, the phrases "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification do not necessarily all refer to the same embodiment, but mean "one or more but not all embodiments", unless otherwise specifically emphasized in other ways. The terms "including", "comprising", "having" and their variations all mean "including but not limited to", unless otherwise specifically emphasized in other ways.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:包括单独存在A,同时存在A和B,以及单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。In this application, "at least one" means one or more, and "plurality" means two or more. "And/or" describes the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B can mean: including the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple.
图1示出了本申请实施例提供的一种系统架构的示意图。如图1所示,客户端可以通过互联网接入云管理平台。云管理平台与数据中心内部网络相连接,通常情况下,数据中心包含多个服务器,图中示出的数据中心包括两个服务器。以服务器#1为例,例如,服务器#1包括软件层和硬件层。例如,软件层又可以包含多个虚拟机,以及宿主操作系统,该宿主操作系统包括虚拟机管理器和云管理平台客户端;例如,硬件层可以包括处理器、内存、硬盘、网卡以及数据总线等等。FIG1 shows a schematic diagram of a system architecture provided by an embodiment of the present application. As shown in FIG1 , a client can access a cloud management platform via the Internet. The cloud management platform is connected to the internal network of a data center. Typically, a data center includes multiple servers. The data center shown in the figure includes two servers. Taking server #1 as an example, for example, server #1 includes a software layer and a hardware layer. For example, the software layer can include multiple virtual machines, and a host operating system, and the host operating system includes a virtual machine manager and a cloud management platform client; for example, the hardware layer can include a processor, memory, hard disk, network card, and data bus, etc.
其中,云管理平台用于提供访问接口(例如,云管理平台用于提供界面或应用程序接口(application programming interface,API))。租户可操作客户端远程接入访问接口在云管理平台注册云账号和密码,并登录云管理平台,云管理平台对云账号和密码鉴权成功后,租户可进一步在云管理平台付费选择并购买特定规格(处理器、内存、磁盘)的虚拟机,付费购买成功后,云管理平台提供所购买的虚拟机的远程登录账号密码,客户端可远程登录该虚拟机,在该虚拟机中安装并运行租户的应用。云管理平台客户端可以用于接收云管理平台发送的控制面命令,根据控制面控制命令在服务器上创建并对虚拟机进行全生命周期管理。因此,租户可通过云管理平台在云数据中心中创建、管理、登录和操作虚拟机。其中,虚拟机也可称为“云服务器(elastic compute service,ECS)”、“弹性实例”,等等,不同的云服务提供商有不同的称呼。The cloud management platform is used to provide an access interface (for example, the cloud management platform is used to provide an interface or an application programming interface (API)). The tenant can operate the client remote access access interface to register a cloud account and password on the cloud management platform and log in to the cloud management platform. After the cloud management platform successfully authenticates the cloud account and password, the tenant can further pay to select and purchase a virtual machine of specific specifications (processor, memory, disk) on the cloud management platform. After the payment is successful, the cloud management platform provides the remote login account and password of the purchased virtual machine, and the client can remotely log in to the virtual machine, install and run the tenant's application in the virtual machine. The cloud management platform client can be used to receive the control plane command sent by the cloud management platform, create and manage the virtual machine on the server according to the control plane control command, and perform full life cycle management on the virtual machine. Therefore, the tenant can create, manage, log in and operate the virtual machine in the cloud data center through the cloud management platform. The virtual machine can also be called "cloud server (elastic compute service, ECS)", "elastic instance", etc. Different cloud service providers have different names.
为便于理解本申请实施例,对本申请实施例中出现的一些术语进行说明。To facilitate understanding of the embodiments of the present application, some terms appearing in the embodiments of the present application are explained.
1、算子集 1. Operator Set
算子的组合,在一些场景下,单一的算子无法满足要求,需要组合至少两个算子进行数据的计算。Combination of operators. In some scenarios, a single operator cannot meet the requirements, and at least two operators need to be combined to calculate the data.
2、时序数据库2. Time Series Database
存储时序数据的数据库,能够容纳大规模的时序数据,支持时序数据的查询、压缩以及时序场景下的聚合、降采样、统计等基础数据分析。The database that stores time series data can accommodate large-scale time series data and support basic data analysis such as query and compression of time series data as well as aggregation, downsampling, and statistics in time series scenarios.
3、时序分析3. Timing Analysis
时序分析是独立于时序数据库的基础分析之外,针对时序数据的高阶分析方法,包括时序异常检测、时序预测、聚类、关联分析等,通过不同的分析方法挖掘时序数据内部的更深层次的价值。通常的方法包括基于统计的方法,基于贝叶斯分析的方法、基于深度学习的方法以及基于机器学习的方法等。Time series analysis is a high-level analysis method for time series data that is independent of the basic analysis of time series databases. It includes time series anomaly detection, time series prediction, clustering, association analysis, etc. It uses different analysis methods to mine the deeper value of time series data. Common methods include statistical methods, Bayesian analysis methods, deep learning methods, and machine learning methods.
4、特征提取4. Feature extraction
特征提取(英语:Feature extraction)在机器学习、模式识别和图像处理中有很多的应用。特征提取是从一个初始测量的资料集合中开始做,然后建构出富含资讯性而且不冗余的导出值,称为特征值(feature)。它可以帮助接续的学习过程和归纳的步骤,在某些情况下可以让人更容易对资料做出较好的诠释。特征提取是一个降低维度的步骤,初始的资料集合被降到更容易管理的族群(特征)以便于学习,同时保持描述原始资料集的精准性与完整性。Feature extraction has many applications in machine learning, pattern recognition and image processing. Feature extraction starts from an initial measured data set and then constructs informative and non-redundant derived values, called features. It can help subsequent learning and inductive steps, and in some cases make it easier for people to make better interpretations of the data. Feature extraction is a dimensionality reduction step, where the initial data set is reduced to more manageable groups (features) for learning, while maintaining the accuracy and completeness of the description of the original data set.
5、推荐系统5. Recommendation System
推荐系统(RS)主要是指应用协同智能(collaborative intelligence)做推荐的技术。个性化推荐系统能够有效的解决信息过载问题,推荐系统根据用户的历史偏好和约束为用户提供排序的个性化物品(item)推荐列表,更精准的推荐系统可以提升和改善用户体验。通常可以根据用户偏好、商品特征、用户-商品交易和其他环境因素(如时间、季节、位置等)生成推荐结果。所推荐的物品可以包括电影、书籍、餐厅、新闻条目等等。Recommendation system (RS) mainly refers to the technology of using collaborative intelligence for recommendation. Personalized recommendation system can effectively solve the problem of information overload. Recommendation system provides users with a sorted personalized list of item recommendations based on their historical preferences and constraints. A more accurate recommendation system can enhance and improve the user experience. Recommendation results can usually be generated based on user preferences, product features, user-product transactions, and other environmental factors (such as time, season, location, etc.). Recommended items can include movies, books, restaurants, news items, etc.
6、数据特征6. Data characteristics
本申请中,数据特征包括数据自身的数据特征和/或提取的数据特征。其中,数据自身的特征是时间序列中的数据的自身特征。例如,包括数据排列周期、数据变化趋势或数据波动等,相应的,数据特征的数据包括:数据排列周期的数据、数据变化趋势数据或数据波动数据等。数据排列周期是指若时间序列中数据周期性排列,该时间序列中数据排列所涉及的周期,例如,数据排列周期的数据包括周期时长(也即两个周期发起的时间间隔)和/或周期个数;数据变化趋势数据用于反映时间序列中数据排列的变化趋势(即数据变化趋势),例如,该数据包括:持续增长、持续下降、先升后降,先降后升,或者满足正态分布等等;数据波动数据用于反映时间序列中数据的波动状态(即数据波动),例如该数据包括表征该时间序列的波动曲线的函数,或者,该时间序列的指定值,如最大值、最小值或平均值。其中,提取的数据特征是提取该时间序列中的数据的过程中的特征。例如,提取特征包括统计特征、拟合特征或频域特征等,相应的,提取特征的数据包括统计特征数据、拟合特征数据或频域特征数据等。统计特征是指时间序列所具有的统计学特征,统计特征有数量特征和属性特征之分,其中数量特征又有计量特征和计数特征之分,数量特征可以直接用数值来表示,例如,CPU、内存、IO资源等多种资源的消耗值为计量特征;而出现异常的次数、正常工作的设备个数是计数特征;属性特征不能直接用数值来表示,如设备是否出现异常、设备是否产生宕机等,统计特征中的特征就是统计时需要考察的指标。例如,该统计特征数据包括移动平均值(Moving_average)、加权平均值(Weighted_mv)等;拟合特征是时间序列拟合时的特征,则拟合特征数据用于反映时间序列用于拟合的特征,例如拟合特征数据包括进行拟合时所采用的算法,如ARIMA;频域特征是时间序列在频域上的特征,则频域特征用于反映时间序列在频域上的特征。例如,频域特征数据包括:时间序列在频域上分布所遵循的规律的数据,如该时间序列中高频分量的占比。可选地,频域特征数据可以通过对时间序列进行小波分解得到。In the present application, data features include data features of the data itself and/or extracted data features. Among them, the features of the data itself are the features of the data in the time series. For example, including data arrangement period, data change trend or data fluctuation, etc., correspondingly, the data of data features include: data of data arrangement period, data change trend data or data fluctuation data, etc. The data arrangement period refers to the period involved in the data arrangement in the time series if the data is arranged periodically in the time series. For example, the data of the data arrangement period includes the period duration (that is, the time interval between two periods) and/or the number of periods; the data change trend data is used to reflect the changing trend of the data arrangement in the time series (that is, the data change trend), for example, the data includes: continuous growth, continuous decline, first rise and then fall, first fall and then rise, or meet the normal distribution, etc.; the data fluctuation data is used to reflect the fluctuation state of the data in the time series (that is, data fluctuation), for example, the data includes a function that characterizes the fluctuation curve of the time series, or a specified value of the time series, such as the maximum value, minimum value or average value. Among them, the extracted data features are the features in the process of extracting the data in the time series. For example, the extracted features include statistical features, fitting features or frequency domain features, etc. Correspondingly, the extracted feature data include statistical feature data, fitting feature data or frequency domain feature data, etc. Statistical features refer to the statistical features of time series. Statistical features are divided into quantitative features and attribute features, among which quantitative features are divided into measurement features and counting features. Quantitative features can be directly expressed by numerical values. For example, the consumption values of various resources such as CPU, memory, and IO resources are measurement features; the number of abnormalities and the number of devices working normally are counting features; attribute features cannot be directly expressed by numerical values, such as whether the device is abnormal or whether the device is downtime, etc. The features in the statistical features are the indicators that need to be examined when statistics are performed. For example, the statistical feature data include moving average (Moving_average), weighted average (Weighted_mv), etc.; fitting features are the features of time series when fitting, and the fitting feature data is used to reflect the features of the time series used for fitting, for example, the fitting feature data includes the algorithm used when fitting, such as ARIMA; frequency domain features are the features of the time series in the frequency domain, and the frequency domain features are used to reflect the features of the time series in the frequency domain. For example, the frequency domain feature data includes: data on the regularity followed by the distribution of the time series in the frequency domain, such as the proportion of high-frequency components in the time series. Optionally, the frequency domain feature data can be obtained by performing wavelet decomposition on the time series.
现有技术中,针对数据量大、格式复杂多样的时序数据,可以提供基本的时序异常检测以及预测场景,例如,云厂商aws可以基于元数据提供低计算量的以字段属性以及数据维度可视化智能分析的能力。再例如,高性能的RCF算法,单一的算法可以满足简单的用户需求。再例如,学术界希望考虑元数据以及数据本身,但计算量会随着预置算法的个数以及数据量线性上升,很难实现落地。因此,当数据种类逐渐丰富,数据库日益增大,基于数据库进行的算法推荐的复杂度明显上升,时序数据的算法推荐效率也很低。In the existing technology, basic time series anomaly detection and prediction scenarios can be provided for time series data with large data volumes and complex and diverse formats. For example, cloud vendor AWS can provide low-computational visualization and intelligent analysis capabilities based on field attributes and data dimensions based on metadata. For another example, a high-performance RCF algorithm, a single algorithm can meet simple user needs. For another example, the academic community hopes to consider metadata and the data itself, but the amount of calculation will increase linearly with the number of preset algorithms and the amount of data, making it difficult to implement. Therefore, as the types of data gradually increase and the database grows larger, the complexity of algorithm recommendations based on the database increases significantly, and the efficiency of algorithm recommendations for time series data is also very low.
有鉴于此,本申请中,针对复杂的数据库提出一种数据分析系统,能够提高算法推荐的效率,提升用户体验。 In view of this, in this application, a data analysis system is proposed for complex databases, which can improve the efficiency of algorithm recommendation and enhance user experience.
图2是本申请实施例提出的一种数据分析的系统200的示意图。FIG. 2 is a schematic diagram of a data analysis system 200 proposed in an embodiment of the present application.
如图2所示,该系统包括前端、计算引擎、数据库和推荐引擎四个部分。As shown in Figure 2, the system consists of four parts: front end, computing engine, database and recommendation engine.
前端用于用户执行操作命令和显示操作结果。The front end is used for users to execute operation commands and display operation results.
例如,用户可以在UI前端选定感兴趣的数据源,前端可以显示后端基于选择的数据源生成推荐页面,基于自动定义分析任务,生成推荐的算子集。再例如,用户可以在前端输入自定义算子,用于在预置算子的基础上扩展特征算子。For example, users can select the data source of interest on the UI front end, and the front end can display the recommendation page generated by the back end based on the selected data source, and generate a recommended operator set based on the automatically defined analysis task. For another example, users can enter custom operators on the front end to expand feature operators based on preset operators.
计算引擎用于对自定义算子进行评估。计算引擎包括评估模块,基于该评估模块对自定义算子的复杂度进行评估,将评估结果输入数据库。The calculation engine is used to evaluate the user-defined operator. The calculation engine includes an evaluation module, based on which the complexity of the user-defined operator is evaluated and the evaluation result is input into the database.
数据库用于存储算子库和前端输入的数据。例如,计算引擎评估后的自定义算子可以加入该算子库,该算子库包括多层不同复杂度的算子,不同复杂度的算子用于捕捉数据特征。前端输入的数据系统会自动加载对应的元数据以及部分数据采样。The database is used to store the operator library and the data input by the front end. For example, the custom operator evaluated by the computing engine can be added to the operator library, which includes multiple layers of operators of different complexity, and operators of different complexity are used to capture data features. The system automatically loads the corresponding metadata and partial data sampling for the data input by the front end.
推荐引擎用于针对元数据和数据采样进行算子推荐,推荐引擎包括特征提取模块和算子推荐模块,其中,特征提取模块用于对数据进行计算得到不同特征算子对应的特征值。算子推荐模块用于基于训练数据集和预置算法集得到的特征值确定特征提取模块得到的特征对应的算子集,从而实现算子集的推荐。The recommendation engine is used to recommend operators for metadata and data sampling. The recommendation engine includes a feature extraction module and an operator recommendation module. The feature extraction module is used to calculate the feature values corresponding to different feature operators. The operator recommendation module is used to determine the operator set corresponding to the features obtained by the feature extraction module based on the feature values obtained by the training data set and the preset algorithm set, thereby realizing the recommendation of the operator set.
本申请中,数据库中存储训练数据和预置算子集,训练数据可以理解为用户基于历史应用需求的重要程度确定的数据集合,该训练数据已进行标记,即,该训练数据已知对应的算子;预置算子集为用户基于历史计算确定的特征算子集合。In the present application, training data and preset operator sets are stored in the database. The training data can be understood as a data set determined by the user based on the importance of historical application needs. The training data has been labeled, that is, the corresponding operators of the training data are known; the preset operator set is a set of feature operators determined by the user based on historical calculations.
需要说明的是,本申请中,特征提取模块使用特征算子对数据进行计算,得到特征值。其中,特征算子可以进行分层处理。It should be noted that, in the present application, the feature extraction module uses a feature operator to calculate the data to obtain a feature value, wherein the feature operator can be processed in layers.
具体的分层处理流程如图3所示的流程。The specific hierarchical processing flow is as shown in FIG3 .
步骤a,将带标记的训练数据集使用所有预置算子集进行计算,得到每个训练数据集与对应算子集的特征值,即,算子与特征值得关系矩阵。Step a: Calculate the labeled training data set using all preset operator sets to obtain the eigenvalues of each training data set and the corresponding operator set, that is, the relationship matrix between the operator and the eigenvalue.
步骤b,通过算子间的卷积,得到算子间的关系型矩阵。Step b: obtain the relational matrix between operators through convolution between operators.
步骤c,依据算子间的关系以及算子的性能两者加权,得到算子分层。Step c: weighting the relationship between operators and the performance of operators to obtain operator stratification.
可以理解,算子分层是将预置算子综合算子之间的关联和性能因素分为不同层,不同层的算子具有不同的复杂度。It can be understood that operator stratification is to divide the associations and performance factors between the preset operators and comprehensive operators into different layers, and operators in different layers have different complexities.
例如图4是本申请实施例提出的一种特征提取模块的结构示意图。For example, FIG4 is a schematic diagram of the structure of a feature extraction module proposed in an embodiment of the present application.
该图4示出了一种算子分层结构,第一层算子为超轻量算子,第二层算子为轻量算子,第三层算子为重度算子。超轻量算子的复杂度低于轻量算子的复杂度,轻量算子的复杂度低于重度算子的复杂度。FIG4 shows an operator hierarchical structure, where the first layer of operators are ultra-lightweight operators, the second layer of operators are lightweight operators, and the third layer of operators are heavy operators. The complexity of ultra-lightweight operators is lower than that of lightweight operators, and the complexity of lightweight operators is lower than that of heavy operators.
应理解,上述算子间的关联,可以理解为,下层算子无法处理的数据,可以由上层算子进行处理,也即,如果超轻量算子无法处理,既可以由轻量算子进行处理,如果轻量算子无法处理,则可以由重度算子进行处理。It should be understood that the relationship between the above operators can be understood as data that cannot be processed by the lower-level operators can be processed by the upper-level operators. That is, if the ultra-lightweight operator cannot process it, it can be processed by the lightweight operator; if the lightweight operator cannot process it, it can be processed by the heavy operator.
图4示出的三种层次的算子分层仅为示例性说明,实际上,算子分层可以包括至少两层特征算子,例如,特征提取模块包括第一层算子和第二层算子,再例如,特征提取模块包括第一层算子,第二层算子,第三层算子及第四层算子。本申请实施例对此不作限定。The three levels of operator stratification shown in FIG4 are only exemplary. In fact, the operator stratification may include at least two layers of feature operators. For example, the feature extraction module includes a first layer of operators and a second layer of operators. For another example, the feature extraction module includes a first layer of operators, a second layer of operators, a third layer of operators, and a fourth layer of operators. This embodiment of the application is not limited to this.
应理解,算子分层后,下层算子可以处理的数据不需要由上层算子进行处理,即,不再进行计算,也即,大多数的数据都是平稳的,都可以通过下层算子进行处理,提取数据特征,因此可以显著提高数据特征提取效率。It should be understood that after the operators are layered, the data that can be processed by the lower-layer operators does not need to be processed by the upper-layer operators, that is, no more calculations are performed. In other words, most of the data are stable and can be processed by the lower-layer operators to extract data features, thereby significantly improving the efficiency of data feature extraction.
应理解,算子分层的结果,可以预先在分析平台上配置,不影响在线性能。It should be understood that the results of operator stratification can be pre-configured on the analysis platform without affecting online performance.
需要说明的是,本申请中,用户可以在原有预置的算子集中,新增自定义算子,这些自定义算子需要用训练数据集进行一次计算,再计算与其他算子间的关系矩阵,由于内置算子已经形成聚类,不需要一一和所有内置算子计算关系矩阵,只需要和聚类进行计算即可。It should be noted that in this application, users can add custom operators to the original preset operator set. These custom operators need to be calculated once with the training data set, and then calculate the relationship matrix with other operators. Since the built-in operators have formed clusters, there is no need to calculate the relationship matrix with all the built-in operators one by one, only the clusters need to be calculated.
需要说明的是,本申请中提出的数据分析系统,可以用于时序数据的分析,也可以用于其他类型数据的分析,本申请实施例对此不作限定。It should be noted that the data analysis system proposed in the present application can be used for the analysis of time series data as well as for the analysis of other types of data, and the embodiments of the present application are not limited to this.
以下结合图5详细说明图2中的系统架构提供数据分析的方法。The following describes in detail the method for providing data analysis in the system architecture of FIG. 2 in conjunction with FIG. 5 .
图5是本申请实施例提出的一种提供数据分析的方法500的流程图。具体包括步骤S510-步骤S530。Fig. 5 is a flow chart of a method 500 for providing data analysis proposed in an embodiment of the present application, which specifically includes steps S510 to S530.
S510,特征提取模块用于接收用户从数据系统确定的第一数据集。S510, the feature extraction module is used to receive a first data set determined by a user from a data system.
第一数据集为用户从数据系统中选定的数据源,例如,用户选定的感兴趣的数据源。 The first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user.
该第一数据集作为特征提取模块的输入。The first data set serves as input to the feature extraction module.
本申请中,数据系统可以是数据库等。In this application, the data system may be a database, etc.
本申请中,第一数据集可以是第一时间序列,第一时间序列包括按照时序排列的一组数据的集合。也即,第一数据集可以包括一组时序数据。In the present application, the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.
本申请中,第一数据集以第一时间序列为例进行说明。In this application, the first data set is described by taking the first time series as an example.
可以理解,本申请中,第一数据集中可以包括其他数据类型,例如,非时序数据等,本申请实施例对此不作限定。It can be understood that in the present application, the first data set may include other data types, such as non-time series data, etc., and the embodiments of the present application are not limited to this.
S520,特征提取模块根据第一数据集确定第一数据子集的数据对应的元数据。S520: The feature extraction module determines metadata corresponding to the data of the first data subset according to the first data set.
第一数据子集可以是第一数据集,或者是第一数据集的子集。The first data subset may be the first data set, or a subset of the first data set.
本申请中,特征提取模块可以根据第一数据集确定第一数据集的数据对应的元数据,也即,用户选定感兴趣的数据源后,系统可以自动加载该数据源对应的元数据。In the present application, the feature extraction module can determine metadata corresponding to the data of the first data set based on the first data set, that is, after the user selects a data source of interest, the system can automatically load the metadata corresponding to the data source.
或者,特征提取模块可以根据第一数据集的子集确定第一数据集的子集对应的元数据,也即,用户选定感兴趣的数据源后,系统可以自动加载该数据源的采样数据以及采样数据对应的元数据。Alternatively, the feature extraction module may determine metadata corresponding to the subset of the first data set based on the subset of the first data set, that is, after the user selects a data source of interest, the system may automatically load sampled data of the data source and metadata corresponding to the sampled data.
应理解,对于一组数据量大且格式复杂的时序数据,仅考虑元数据进行分析,推荐的算子会有偏差,综合考虑元数据以及部分数据采样,相比考虑全部数据可以降低计算量,相比仅考虑元数据,可以提高推荐算子的效果。It should be understood that for a set of time series data with a large amount of data and complex format, if only metadata is considered for analysis, the recommended operators will be biased. Comprehensive consideration of metadata and partial data sampling can reduce the amount of calculation compared to considering all data, and can improve the effect of recommended operators compared to considering only metadata.
S530,特征提取模块基于至少两层特征算子确定元数据的目标特征。S530: The feature extraction module determines target features of the metadata based on at least two layers of feature operators.
本申请中,目标特征包括第一时间序列的数据特征。In the present application, the target features include data features of the first time series.
示例性的,目标特征可以包括目标特征向量。Exemplarily, the target feature may include a target feature vector.
特征提取模块对输入的元数据进行计算,从而确定目标特征。The feature extraction module calculates the input metadata to determine the target features.
本申请中,特征提取模块包括第一层算子和第二层算子至少两层特征算子,第一层算子用于基于元数据提取目标特征,当第一层算子未提取到目标特征时,由第二层算子基于元数据提取目标特征,第二层算子的复杂度高于第一层算子的复杂度。In the present application, the feature extraction module includes at least two layers of feature operators, namely, a first layer operator and a second layer operator. The first layer operator is used to extract target features based on metadata. When the first layer operator fails to extract the target features, the second layer operator extracts the target features based on metadata. The complexity of the second layer operator is higher than that of the first layer operator.
应理解,特征提取模块还可以包括第三层算子,第四层算子等,本申请实施例以至少两层算子中的第一层算子和第二层算子为例,本申请实施例对此不作限定。It should be understood that the feature extraction module may also include a third layer operator, a fourth layer operator, etc. The embodiment of the present application takes the first layer operator and the second layer operator of at least two layers of operators as an example, and the embodiment of the present application does not limit this.
以下对特征提取模块中的多层特征提取算子提取目标特征的过程进行详细说明。The following is a detailed description of the process of extracting target features by the multi-layer feature extraction operator in the feature extraction module.
元数据输入特征提取模块后,先经过第一层算子,经过第一层算子的计算得到第一特征值,该第一特征值满足第一条件时,第一层算子提取到目标特征,完成特征提取流程;如果第一特征值不满足第一条件时,元数据进入第二层算子,由第二层算子进行计算得到第二特征值,当第二特征值满足第一条件时,第二层算子提取到目标特征,完成特征提取流程。After the metadata is input into the feature extraction module, it first passes through the first layer operator and obtains the first eigenvalue through calculation by the first layer operator. When the first eigenvalue meets the first condition, the first layer operator extracts the target feature and completes the feature extraction process. If the first eigenvalue does not meet the first condition, the metadata enters the second layer operator and obtains the second eigenvalue through calculation by the second layer operator. When the second eigenvalue meets the first condition, the second layer operator extracts the target feature and completes the feature extraction process.
可以理解,如果第二特征值不满足第一条件时,由第三层算子进行计算得到第三特征值,判断第三特征值是否满足第一条件,从而确定是否提取到目标特征,以此类推,直到满足第一条件结束特征提取流程。It can be understood that if the second eigenvalue does not meet the first condition, the third layer operator calculates the third eigenvalue to determine whether the third eigenvalue meets the first condition, thereby determining whether the target feature is extracted, and so on, until the first condition is met and the feature extraction process is ended.
其中,特征值具体可以是特征值向量,特征值向量包括多个特征值。下述为特征值向量的定义:
C=(c1 c2 c3 … cn)
The eigenvalue may specifically be an eigenvalue vector, which includes multiple eigenvalues. The following is the definition of an eigenvalue vector:
C=(c 1 c 2 c 3 … c n )
C为特征值,n为特征算子的个数,每个特征算子都对应一个特征值。其中未被计算的特征算子所对应的特征值为0。该未被计算的特征算子可以理解为,当第一特征值向量满足了第一条件,不需要第二层算子计算,则第二层算子不需要进行计算,第二层算子中的特征算子对应的特征值为0。C is the eigenvalue, n is the number of eigenoperators, and each eigenoperator corresponds to an eigenvalue. The eigenvalue corresponding to the uncalculated eigenoperator is 0. The uncalculated eigenoperator can be understood as follows: when the first eigenvalue vector satisfies the first condition and the second-layer operator calculation is not required, the second-layer operator does not need to be calculated, and the eigenvalue corresponding to the eigenoperator in the second-layer operator is 0.
本申请中,提取到目标特征,可以理解为经过特征算子计算得到的特征值满足第一条件。In the present application, extracting the target feature can be understood as the feature value calculated by the feature operator satisfies the first condition.
该第一条件可以是阈值。The first condition may be a threshold value.
示例性的,当特征值向量中的特征值都大于一个阈值,即,认为满足第一条件,捕捉到数据特征。Exemplarily, when the eigenvalues in the eigenvalue vector are all greater than a threshold, that is, it is considered that the first condition is satisfied and the data feature is captured.
该第一条件也可以是区间。The first condition may also be an interval.
示例性的,当特征值向量中的特征值都在该区间内,即,认为满足第一条件,捕捉到数据特征。Exemplarily, when the eigenvalues in the eigenvalue vector are all within the interval, that is, it is considered that the first condition is satisfied and the data feature is captured.
该第一条件可以是任意预设的条件,本申请实施例对此不作限定。The first condition may be any preset condition, and the embodiment of the present application does not limit this.
应理解,上述第一层算子为复杂度最低的算子层,第二层算子为复杂度高于第一层算子的算子层,如果有第三层或第四层,则复杂度逐层增加。特征提取从复杂度最低的算子层开始计算,下层算子可以捕捉到目标特征,则特征提取结束,不需要继续计算,反之,需要逐层向上,直到捕捉到目标特征。It should be understood that the first layer of operators is the lowest complexity operator layer, the second layer of operators is the operator layer with higher complexity than the first layer of operators, and if there is a third or fourth layer, the complexity increases layer by layer. Feature extraction starts from the lowest complexity operator layer. If the lower layer operator can capture the target feature, the feature extraction ends and no further calculation is required. Otherwise, it is necessary to go up layer by layer until the target feature is captured.
以下以图4中示出的一种特征提取模块为例,详细说明该阶梯式特征提取框架确定目标特征向量的 具体方式。The following takes a feature extraction module shown in FIG4 as an example to explain in detail how the step-by-step feature extraction framework determines the target feature vector. Specific method.
元数据先经过该模块中的超轻量算子的计算,评估是否可以捕捉到数据特征,如果可以捕捉到目标特征,就完成特征提取流程,返回特征值向量。如果没有捕捉到目标特征,就继续由轻量算子进行计算,捕捉到目标特征,就完成特征提取流程,返回特征值向量,如果还没有捕捉到目标特征,则,继续由重度算子继续计算,捕捉目标特征。The metadata is first calculated by the ultra-lightweight operator in this module to evaluate whether the data features can be captured. If the target features can be captured, the feature extraction process is completed and the eigenvalue vector is returned. If the target features are not captured, the lightweight operator continues to calculate. If the target features are captured, the feature extraction process is completed and the eigenvalue vector is returned. If the target features are not captured, the heavy operator continues to calculate and capture the target features.
S540,算子推荐模块基于目标特征确定第一算子集,第一算子集用于对第一数据集执行数据分析。S540: The operator recommendation module determines a first operator set based on the target feature, where the first operator set is used to perform data analysis on the first data set.
算子推荐模块具体用于基于参考特征集合和所述目标特征确定第一算子集。The operator recommendation module is specifically configured to determine a first operator set based on a reference feature set and the target feature.
参考特征集合为基于训练数据集用预置算子集计算得到的特征值集合。The reference feature set is a set of feature values calculated based on the training data set using a preset operator set.
图6是本申请实施例提出的一种算子推荐流程图。结合图6详细说明算子推荐流程。Fig. 6 is a flowchart of an operator recommendation proposed in an embodiment of the present application. The operator recommendation process is described in detail in conjunction with Fig. 6.
步骤a:将所有的训练数据集用所有预置算子集T进行计算,得到每个算子集对应的特征值。所有算子集对应的特征值组成参考特征集合。Step a: All training data sets are calculated using all preset operator sets T to obtain the eigenvalues corresponding to each operator set. The eigenvalues corresponding to all operator sets constitute the reference feature set.
本申请中,预置算子集T,m为算子集的个数。
T=(t1 t2 t3 … tm)
In this application, the preset operator set T, m is the number of operator sets.
T=(t 1 t 2 t 3 … t m )
其中,预置算子集包括特征提取模块中的至少两层算子中的部分或全部算子。The preset operator set includes part or all of the operators in at least two layers of operators in the feature extraction module.
应理解,训练数据集中的数据已进行标记,即,该训练数据集中的每个训练集都有对应的算子集,因此可以构建算子集和特征值的映射关系矩阵。
It should be understood that the data in the training data set has been labeled, that is, each training set in the training data set has a corresponding operator set, so a mapping relationship matrix between the operator set and the eigenvalue can be constructed.
其中,R为算子集与特征值的映射关系矩阵。
Ri=(rc1ti rc2ti rc3ti … rcnti)
Among them, R is the mapping relationship matrix between operator sets and eigenvalues.
R i =(r c1ti r c2ti r c3ti ... r cnti )
其中,Ri为算子集i对应的特征向量。Among them, Ri is the eigenvector corresponding to operator set i.
步骤b,基于算子集与特征值的映射关系矩阵,结合元数据确定推荐算子结果。Step b: Determine the recommended operator result based on the mapping relationship matrix between the operator set and the eigenvalues in combination with the metadata.
具体的,算子推荐模块可以确定目标特征与参考特征集合中参考特征的相似度,当目标特征与第一参考特征的相似度大于相似度阈值时,确定第一参考特征对应的第一算子集为目标特征的算子集,第一算子集属于所述预置算子集,第一参考特征属于所述参考特征集合。Specifically, the operator recommendation module can determine the similarity between the target feature and the reference feature in the reference feature set. When the similarity between the target feature and the first reference feature is greater than the similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
应理解,目标特征和第一参考特征的相似度与目标特征和第一参考特征之间的距离负相关。也即,两个特征的相似度越大,距离越小,相似度越小,距离越大。It should be understood that the similarity between the target feature and the first reference feature is negatively correlated with the distance between the target feature and the first reference feature. That is, the greater the similarity of the two features, the smaller the distance, and the smaller the similarity, the greater the distance.
本申请中的特征可以特征向量的形式,以下以特征向量为例进行说明。The features in the present application may be in the form of feature vectors, and the following description will be given using feature vectors as an example.
也就是说,可以先确定目标特征向量和第一参考特征向量之间的距离,基于获取的距离确定相似度。That is, the distance between the target feature vector and the first reference feature vector may be determined first, and the similarity may be determined based on the obtained distance.
目标特征向量和第一参考特征向量之间的距离可以通过多种方式获取,本申请实施例对此不作限定。The distance between the target feature vector and the first reference feature vector can be obtained in a variety of ways, which is not limited in this embodiment of the present application.
一种可能的实施方式中,新选择的数据(元数据)通过阶梯特征提取后,得到目标特征向量Cu。
Ui=DT(Ri,Cu)
In a possible implementation, the newly selected data (metadata) is extracted through step features to obtain a target feature vector Cu.
U i = DT(R i , Cu )
ui为目标特征向量相对于参考特征向量的距离。ui is the distance of the target feature vector relative to the reference feature vector.
可选的一种方式中,可以确定距离最短的参考特征向量对应的第一算子集为目标特征向量的算子集。In an optional manner, the first operator set corresponding to the reference feature vector with the shortest distance may be determined as the operator set of the target feature vector.
可选的另一种实施方式中,预先设置相似度阈值,当相似度达到该阈值时,即可认为第一参考特征向量对应的第一算子集为目标特征向量的算子集。In another optional implementation, a similarity threshold is preset, and when the similarity reaches the threshold, the first operator set corresponding to the first reference feature vector can be considered as the operator set of the target feature vector.
本申请中,数据驱动模块、传输控制模块、信息采集模块均可以通过软件实现,或者可以通过硬件实现。示例性的,接下来介绍数据驱动模块的实现方式。类似的,传输控制模块、信息采集模块的实现方式可以参考数据驱动模块的实现方式。In the present application, the data driver module, the transmission control module, and the information acquisition module can be implemented by software or by hardware. As an example, the implementation of the data driver module is described below. Similarly, the implementation of the transmission control module and the information acquisition module can refer to the implementation of the data driver module.
本申请中,“模块”作为软件功能单元的一种举例,特征提取模块和算子推荐模块可以包括运行在计算实例上的代码。其中,计算实例可以是物理主机(计算设备)、虚拟机、容器等计算设备中的至少一种。进一步地,上述计算设备可以是一台或者多台。例如,数据驱动模块可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该应用程序的多个主机/虚拟机/容器可以分布在相同的region中,也可以分布在不同的region中。用于运行该代码的多个主机/虚拟机/容器可以分布在相同的AZ中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个 region可以包括多个AZ。In this application, "module" is used as an example of a software functional unit. The feature extraction module and the operator recommendation module may include code running on a computing instance. Among them, the computing instance may be at least one of a physical host (computing device), a virtual machine, a container and other computing devices. Furthermore, the above-mentioned computing device may be one or more. For example, a data-driven module may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the application can be distributed in the same region or in different regions. The multiple hosts/virtual machines/containers used to run the code can be distributed in the same AZ or in different AZs, and each AZ includes one data center or multiple data centers with close geographical locations. Among them, usually one A region can include multiple AZs.
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个VPC中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内。同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。Similarly, multiple hosts/virtual machines/containers used to run the code can be distributed in the same VPC or in multiple VPCs. Usually, a VPC is set up in a region. For cross-region communication between two VPCs in the same region and between VPCs in different regions, a communication gateway must be set up in each VPC to achieve interconnection between VPCs through the communication gateway.
模块作为硬件功能单元的一种举例,特征提取模块可以包括至少一个计算设备,如服务器等。或者,特征提取模块也可以是利用ASIC实现、或PLD实现的设备等。其中,上述PLD可以是CPLD、FPGA、GAL或其任意组合实现。As an example of a hardware functional unit, the feature extraction module may include at least one computing device, such as a server, etc. Alternatively, the feature extraction module may also be a device implemented using ASIC or PLD, etc. The PLD may be implemented using CPLD, FPGA, GAL or any combination thereof.
特征提取模块包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。特征提取模块包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,特征提取模块包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。The multiple computing devices included in the feature extraction module can be distributed in the same region or in different regions. The multiple computing devices included in the feature extraction module can be distributed in the same AZ or in different AZs. Similarly, the multiple computing devices included in the feature extraction module can be distributed in the same VPC or in multiple VPCs. The multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
本申请还提供一种计算设备100。如图7所示,计算设备100包括:总线102、处理器104、存储器106和通信接口108。处理器104、存储器106和通信接口108之间通过总线102通信。计算设备100可以是服务器或终端设备。应理解,本申请不限定计算设备100中的处理器、存储器的个数。The present application also provides a computing device 100. As shown in FIG. 7 , the computing device 100 includes: a bus 102, a processor 104, a memory 106, and a communication interface 108. The processor 104, the memory 106, and the communication interface 108 communicate with each other through the bus 102. The computing device 100 may be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 100.
总线102可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线104可包括在计算设备100各个部件(例如,存储器106、处理器104、通信接口108)之间传送信息的通路。The bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG8 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 104 may include a path for transmitting information between various components of the computing device 100 (e.g., the memory 106, the processor 104, the communication interface 108).
处理器104可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 104 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).
存储器106可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器104还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 106 may include a volatile memory, such as a random access memory (RAM). The processor 104 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).
存储器106中存储有可执行的程序代码,处理器104执行该可执行的程序代码以分别实现前述计算引擎、数据库、特征提取模块以及算子推荐模块的功能,从而实现提供数据分析的方法。也即,存储器106上存有用于执行数据分析的指令。The memory 106 stores executable program codes, and the processor 104 executes the executable program codes to respectively implement the functions of the aforementioned computing engine, database, feature extraction module, and operator recommendation module, thereby implementing the method of providing data analysis. That is, the memory 106 stores instructions for executing data analysis.
或者,存储器106中存储有可执行的代码,处理器104执行该可执行的代码以分别实现前述计算引擎、数据库、特征提取模块以及算子推荐模块的功能,从而实现提供数据分析的方法。也即,存储器106上存有用于执行数据分析的指令。Alternatively, the memory 106 stores executable codes, and the processor 104 executes the executable codes to respectively implement the functions of the aforementioned computing engine, database, feature extraction module, and operator recommendation module, thereby implementing the method for providing data analysis. That is, the memory 106 stores instructions for performing data analysis.
通信接口103使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备100与其他设备或通信网络之间的通信。The communication interface 103 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。The embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
如图8所示,所述计算设备集群包括至少一个计算设备100。计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行数据分析的指令。As shown in Fig. 8, the computing device cluster includes at least one computing device 100. The memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for performing data analysis.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备100的存储器106中也可以分别存有用于执行数据分析的部分指令。换言之,一个或多个计算设备100的组合可以共同执行用于执行数据分析的指令。In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for performing data analysis. In other words, the combination of one or more computing devices 100 may jointly execute instructions for performing data analysis.
需要说明的是,计算设备集群中的不同的计算设备100中的存储器106可以存储不同的指令,可以实现计算引擎、数据库、特征提取模块以及算子推荐模块中的一个或多个模块的功能。It should be noted that the memory 106 in different computing devices 100 in the computing device cluster can store different instructions, and can implement the functions of one or more modules among the computing engine, database, feature extraction module and operator recommendation module.
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图9示出了一种可能的实现方式。如图9所示,两个计算设备100A和100B之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备100A中的存储器106中存有执行计算引擎和数据库的功能的指令。同时,计算设备100B中的存储器106中存有特征提取模块和算子推荐模块的功能的指令。In some possible implementations, one or more computing devices in a computing device cluster may be connected via a network. The network may be a wide area network or a local area network, etc. FIG. 9 shows a possible implementation. As shown in FIG. 9 , two computing devices 100A and 100B are connected via a network. Specifically, the network is connected via a communication interface in each computing device. In this type of possible implementation, the memory 106 in the computing device 100A stores instructions for executing the functions of a computing engine and a database. At the same time, the memory 106 in the computing device 100B stores instructions for the functions of a feature extraction module and an operator recommendation module.
应理解,图9中示出的计算设备100A的功能也可以由多个计算设备100完成。同样,计算设备 100B的功能也可以由多个计算设备100完成。It should be understood that the functions of the computing device 100A shown in FIG9 may also be performed by multiple computing devices 100. The functionality of 100B may also be performed by multiple computing devices 100 .
本申请实施例还提供了另一种计算设备集群。该计算设备集群中各计算设备之间的连接关系可以类似的参考图8和图9所述计算设备集群的连接方式。不同的是,该计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行控制传输控制方案的指令。The embodiment of the present application also provides another computing device cluster. The connection relationship between the computing devices in the computing device cluster can be similar to the connection mode of the computing device cluster described in Figures 8 and 9. The difference is that the memory 106 in one or more computing devices 100 in the computing device cluster can store the same instructions for executing the control transmission control scheme.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备100的存储器106中也可以分别存有用于执行数据分析的部分指令。换言之,一个或多个计算设备100的组合可以共同执行用于执行数据分析的指令。In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for performing data analysis. In other words, the combination of one or more computing devices 100 may jointly execute instructions for performing data analysis.
需要说明的是,计算设备集群中的不同的计算设备100中的存储器106可以存储不同的指令,用于执行数据分析系统的部分功能。也即,不同的计算设备100中的存储器106存储的指令可以实现计算引擎、数据库、特征提取模块以及算子推荐模块中的一个或多个模块的功能。It should be noted that the memory 106 in different computing devices 100 in the computing device cluster may store different instructions for executing part of the functions of the data analysis system. That is, the instructions stored in the memory 106 in different computing devices 100 may implement the functions of one or more modules among the computing engine, database, feature extraction module, and operator recommendation module.
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时,使得至少一个计算设备执行方法500。The embodiment of the present application also provides a computer program product including instructions. The computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device executes method 500.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行方法500。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center that contains one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk). The computer-readable storage medium includes instructions that instruct the computing device to execute method 500.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.
各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。 The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art who is familiar with the present technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims (14)

  1. 一种数据分析系统,其特征在于,所述系统包括:A data analysis system, characterized in that the system comprises:
    特征提取模块,用于接收用户从数据系统确定的第一数据集;A feature extraction module, configured to receive a first data set determined by a user from a data system;
    根据所述第一数据集,确定第一数据子集的数据对应的元数据,所述第一数据子集是所述第一数据集或所述第一数据集的子集;Determining metadata corresponding to data of a first data subset according to the first data set, where the first data subset is the first data set or a subset of the first data set;
    基于至少两层特征算子,确定所述元数据的目标特征,所述两层特征算子包括第一层算子和第二层算子,所述第一层算子用于基于所述元数据提取所述目标特征,所述第二层算子用于当所述第一层算子未提取到所述目标特征时基于所述元数据提取所述目标特征,所述第二层算子的复杂度高于所述第一层算子的复杂度;Determine the target feature of the metadata based on at least two layers of feature operators, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators are used to extract the target feature based on the metadata, the second layer of operators are used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, and the complexity of the second layer of operators is higher than the complexity of the first layer of operators;
    算子推荐模块,用于基于所述目标特征确定第一算子集,所述第一算子集用于所述第一数据集的分析。An operator recommendation module is used to determine a first operator set based on the target feature, where the first operator set is used for analyzing the first data set.
  2. 根据权利要求1所述的系统,其特征在于,所述第一层算子具体用于对所述元数据进行计算得到第一特征值,当所述第一特征值满足第一条件时,所述第一层算子提取到所述目标特征;当所述第一特征值不满足所述第一条件时,所述第二层算子具体用于,对所述元数据进行计算得到第二特征值,当所述第二特征值满足所述第一条件时,所述第二层算子提取到所述目标特征。The system according to claim 1 is characterized in that the first layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.
  3. 根据权利要求1或2所述的系统,其特征在于,所述算子推荐模块具体用于基于参考特征集合和所述目标特征确定所述第一算子集,所述参考特征集合为基于训练数据集用预置算子集计算得到的特征值集合,所述预置算子集包括所述至少两层算子的部分或全部算子。The system according to claim 1 or 2 is characterized in that the operator recommendation module is specifically used to determine the first operator set based on a reference feature set and the target feature, the reference feature set is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the at least two layers of operators.
  4. 根据权利要求3所述的系统,其特征在于,所述算子推荐模块具体用于确定所述目标特征与所述参考特征集合中参考特征的相似度,当所述目标特征与第一参考特征的相似度大于相似度阈值时,确定所述第一参考特征对应的第一算子集为所述目标特征的算子集,所述第一算子集属于所述预置算子集,所述第一参考特征属于所述参考特征集合。The system according to claim 3 is characterized in that the operator recommendation module is specifically used to determine the similarity between the target feature and the reference features in the reference feature set. When the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
  5. 根据权利要求1-4项中任一项所述的系统,其特征在于,所述系统还包括:算子评估模块,用于评估自定义算子的复杂度,所述自定义算子的复杂度用于确定将所述自定义算子内置于所述特征提取模块中的至少两层特征算子的其中一层。The system according to any one of claims 1-4 is characterized in that the system further comprises: an operator evaluation module, used to evaluate the complexity of the custom operator, and the complexity of the custom operator is used to determine to embed the custom operator into one of the at least two layers of feature operators in the feature extraction module.
  6. 一种云服务系统中的数据分析方法,其特征在于,包括:A data analysis method in a cloud service system, characterized by comprising:
    特征提取模块接收用户从数据系统确定的第一数据集;The feature extraction module receives a first data set determined by a user from a data system;
    所述特征提取模块根据所述第一数据集,确定第一数据子集的数据对应的元数据,所述第一数据子集是所述第一数据集或所述第一数据集的子集;The feature extraction module determines metadata corresponding to data of a first data subset based on the first data set, where the first data subset is the first data set or a subset of the first data set;
    基于至少两层特征算子,确定所述元数据的目标特征,所述两层特征算子包括第一层算子和第二层算子,所述第一层算子用于基于所述元数据提取所述目标特征,所述第二层算子用于当所述第一层算子未提取到所述目标特征时基于所述元数据提取所述目标特征,所述第二层算子的复杂度高于所述第一层算子的复杂度;Determine the target feature of the metadata based on at least two layers of feature operators, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators are used to extract the target feature based on the metadata, the second layer of operators are used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, and the complexity of the second layer of operators is higher than the complexity of the first layer of operators;
    算子推荐模块基于所述目标特征确定第一算子集,所述第一算子集用于所述第一数据集的分析。The operator recommendation module determines a first operator set based on the target feature, where the first operator set is used for analyzing the first data set.
  7. 根据权利要求6所述的方法,其特征在于,所述第一层算子对所述元数据进行计算得到第一特征值,当所述第一特征值满足第一条件时,所述第一层算子提取到所述目标特征;当所述第一特征值不满足所述第一条件时,所述第二层算子对所述元数据进行计算得到第二特征值,当所述第二特征值满足所述第一条件时,所述第二层算子提取到所述目标特征。The method according to claim 6 is characterized in that the first layer operator calculates the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator calculates the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.
  8. 根据权利要求6或7所述的方法,其特征在于,所述算子推荐模块基于所述目标特征确定第一算子集,包括:The method according to claim 6 or 7, characterized in that the operator recommendation module determines the first operator set based on the target feature, comprising:
    所述算子推荐模块基于参考特征集合和所述目标特征确定所述第一算子集,所述参考特征为基于训练数据集用预置算子集计算得到的特征值集合,所述预置算子集包括所述两层算子的部分或全部算子。The operator recommendation module determines the first operator set based on a reference feature set and the target feature, wherein the reference feature is a feature value set calculated based on a training data set using a preset operator set, and the preset operator set includes some or all operators of the two layers of operators.
  9. 根据权利要求8所述的方法,其特征在于,所述算子推荐模块基于参考特征集合和所述目标特征确定所述第一算子集,包括:The method according to claim 8, characterized in that the operator recommendation module determines the first operator set based on the reference feature set and the target feature, comprising:
    所述算子推荐模块确定所述目标特征与所述参考特征集合中参考特征的相似度,当所述目标特征与第一参考特征的相似度大于相似度阈值时,确定所述第一参考特征对应的第一算子集为所述目标特征的 算子集,所述第一算子集属于所述预置算子集,所述第一参考特征属于所述参考特征集合。The operator recommendation module determines the similarity between the target feature and the reference features in the reference feature set, and when the similarity between the target feature and the first reference feature is greater than a similarity threshold, determines that the first operator set corresponding to the first reference feature is the target feature. Operator set, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
  10. 根据权利要求6-9项中任一项所述的方法,其特征在于,所述方法还包括:算子评估模块评估自定义算子的复杂度,所述自定义算子的复杂度用于确定将所述自定义算子内置于所述特征提取模块中的至少两层特征算子的其中一层。The method according to any one of claims 6 to 9 is characterized in that the method further comprises: an operator evaluation module evaluating the complexity of the custom operator, the complexity of the custom operator being used to determine to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
  11. 一种计算设备,其特征在于,包括:处理器和存储器;所述处理器运行所述存储器中的指令,使得所述计算设备执行如权利要求6至10中任一项所述的方法。A computing device, comprising: a processor and a memory; the processor runs instructions in the memory, so that the computing device executes the method as claimed in any one of claims 6 to 10.
  12. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;A computing device cluster, characterized in that it includes at least one computing device, each computing device includes a processor and a memory;
    所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求6至10中任一项所述的方法。The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method according to any one of claims 6 to 10.
  13. 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算机设备集群运行时,使得所述计算机设备集群执行如权利要求的6至10中任一项所述的方法。A computer program product comprising instructions, wherein when the instructions are executed by a computer device cluster, the computer device cluster executes the method according to any one of claims 6 to 10.
  14. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算机设备集群执行时,所述计算机设备集群执行如权利要求6至10中任一项所述的方法。 A computer-readable storage medium, characterized in that it includes computer program instructions. When the computer program instructions are executed by a computer device cluster, the computer device cluster executes the method as described in any one of claims 6 to 10.
PCT/CN2023/135554 2022-12-22 2023-11-30 Data analysis system, method and device WO2024131499A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211654936.7 2022-12-22
CN202211654936.7A CN118277443A (en) 2022-12-22 2022-12-22 Data analysis system, method and device

Publications (1)

Publication Number Publication Date
WO2024131499A1 true WO2024131499A1 (en) 2024-06-27

Family

ID=91587652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/135554 WO2024131499A1 (en) 2022-12-22 2023-11-30 Data analysis system, method and device

Country Status (2)

Country Link
CN (1) CN118277443A (en)
WO (1) WO2024131499A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181641A1 (en) * 2015-06-23 2018-06-28 Entit Software Llc Recommending analytic tasks based on similarity of datasets
CN109784395A (en) * 2019-01-07 2019-05-21 西安交通大学 A kind of algorithm recommended method for unbalanced data
CN110490238A (en) * 2019-08-06 2019-11-22 腾讯科技(深圳)有限公司 A kind of image processing method, device and storage medium
KR20220132804A (en) * 2021-03-24 2022-10-04 경희대학교 산학협력단 Apparatus and method of recommending sampling method and classification algorithm by using metadata set
CN115187549A (en) * 2022-07-11 2022-10-14 广州小鹏自动驾驶科技有限公司 Image gray processing method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181641A1 (en) * 2015-06-23 2018-06-28 Entit Software Llc Recommending analytic tasks based on similarity of datasets
CN109784395A (en) * 2019-01-07 2019-05-21 西安交通大学 A kind of algorithm recommended method for unbalanced data
CN110490238A (en) * 2019-08-06 2019-11-22 腾讯科技(深圳)有限公司 A kind of image processing method, device and storage medium
KR20220132804A (en) * 2021-03-24 2022-10-04 경희대학교 산학협력단 Apparatus and method of recommending sampling method and classification algorithm by using metadata set
CN115187549A (en) * 2022-07-11 2022-10-14 广州小鹏自动驾驶科技有限公司 Image gray processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN118277443A (en) 2024-07-02

Similar Documents

Publication Publication Date Title
WO2020155755A1 (en) Spectral clustering-based optimization method for anomaly point ratio, device, and computer apparatus
KR101939554B1 (en) Determining a temporary transaction limit
CN107305637B (en) Data clustering method and device based on K-Means algorithm
US11789985B2 (en) Method for determining competitive relation of points of interest, device
US11631205B2 (en) Generating a data visualization graph utilizing modularity-based manifold tearing
US20160063081A1 (en) Multidimensional Graph Analytics
WO2021169445A1 (en) Information recommendation method and apparatus, computer device, and storage medium
WO2015180340A1 (en) Data mining method and device
CN108959259A (en) New word discovery method and system
WO2024098699A1 (en) Entity object thread detection method and apparatus, device, and storage medium
CN107392259A (en) The method and apparatus for building unbalanced sample classification model
CN113435523B (en) Method, device, electronic equipment and storage medium for predicting content click rate
JP2020198080A (en) System and method including a plurality of sensors monitoring one or more processes and providing sensor data
WO2024131499A1 (en) Data analysis system, method and device
CN113961797A (en) Resource recommendation method and device, electronic equipment and readable storage medium
CA2956155A1 (en) Methods and apparatus for comparing different types of data
CN115587228B (en) Object query method, object storage method and device
CN114662607B (en) Data labeling method, device, equipment and storage medium based on artificial intelligence
CN116309002B (en) Graph data storage, access and processing methods, training methods, equipment and media
US20230017215A1 (en) Modeling method and apparatus
CN116861226A (en) Data processing method and related device
CN113361402A (en) Training method of recognition model, method, device and equipment for determining accuracy
US11836612B2 (en) Maintaining master data using hierarchical classification
EP3629520B1 (en) Fast and efficient classification system
CN108764997A (en) Take in recognition methods, device and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23905666

Country of ref document: EP

Kind code of ref document: A1