WO2024131499A1

WO2024131499A1 - Data analysis system, method and device

Info

Publication number: WO2024131499A1
Application number: PCT/CN2023/135554
Authority: WO
Inventors: 李伟琪; 黄飞腾; 黄永强
Original assignee: 华为云计算技术有限公司
Priority date: 2022-12-22
Filing date: 2023-11-30
Publication date: 2024-06-27
Also published as: CN118277443A

Abstract

A data analysis system, comprising: a feature extraction module which comprises at least two layers of feature operators: a first layer of operators and a second layer of operators, wherein the first layer of operators extracts a target feature on the basis of metadata, when the target feature is not extracted by the first layer of operators, the second layer of operators extracts the target feature, the complexity of the second layer of operators is higher than that of the first layer of operators, the metadata comprises a data subset determined by a first user for a first data set, and the first data set comprises a set of data; and an operator recommendation module, used for determining a first operator set on the basis of the target feature, the first operator set being used for analyzing the first data set.

Description

A data analysis system, method and device

This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on December 22, 2022, with application number 202211654936.7 and invention name “A data analysis system, method and device”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of information technology, and more specifically, to a data analysis system, method and device.

Background technique

With the development of information technology, the demand for databases is increasing. Currently, there are many types of databases, such as relational databases and time series databases. Among them, the demand for time series databases has increased significantly.

The data stored in a time series database is usually called time series data. Time series analysis based on time series data can explore the deeper value inside the time series data. Common time series analysis includes time series anomaly detection, time series prediction, clustering, association analysis, etc. For time series data with large data volume and complex and diverse formats, basic time series anomaly detection and prediction scenarios can be provided. For example, cloud vendor AWS can provide low-computational intelligent analysis capabilities based on field attributes and data dimensions based on metadata. For another example, a high-performance RCF algorithm, a single algorithm can meet simple user needs. As the types of data gradually become richer and the database becomes larger, the complexity of algorithm recommendations based on the database increases significantly, and the efficiency of algorithm recommendations for time series data is also very low.

Therefore, how to recommend algorithms for complex databases, improve the efficiency of algorithm recommendations, and enhance user experience has become a technical problem that needs to be solved urgently.

Summary of the invention

The present application provides a data analysis system, method and device, which adopts a step feature extraction framework to extract features of data of interest to the user, can quickly extract data features of the data source, and then quickly recommend suitable analysis operators for the data source, thereby improving the efficiency of data algorithm recommendation in complex data scenarios and the user experience.

In a first aspect, a data analysis system is provided, comprising: a feature extraction module, configured to receive a first data set determined by a user from a data system; determining metadata corresponding to data of a first data subset based on the first data set, the first data subset being the first data set or a subset of the first data set; determining target features of the metadata based on at least two layers of feature operators, the two layers of feature operators comprising a first layer of operators and a second layer of operators, the first layer of operators being configured to extract the target features based on the metadata, the second layer of operators being configured to extract the target features based on the metadata when the first layer of operators fail to extract the target features, the complexity of the second layer of operators being higher than the complexity of the first layer of operators; and an operator recommendation module, configured to determine a first operator set based on the target features, the first operator set being used for analyzing the first data set.

In the present application, the first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user. The first data set is used as an input of the feature extraction module.

In the present application, the first data set may be a first time series, and the first time series includes a set of data arranged in time sequence. That is, the first data set may include a set of time series data.

In the present application, the first data set may include other data types, such as non-time series data, etc., which is not limited in the embodiments of the present application.

In the above technical solution, the feature extraction module includes multiple layers of operators with different complexities. The calculation is performed layer by layer through the operator with the lowest complexity until the data features are captured. The operator recommendation module then recommends a suitable set of operators. This step-by-step feature extraction framework can improve the efficiency of data feature extraction, and then quickly recommend suitable analysis operators for the data source, thereby improving the efficiency of data algorithm recommendations and user experience in complex data scenarios.

In a possible implementation, the first-layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first-layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second-layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second-layer operator extracts the target feature.

Based on the above technical solution, operator layers of different complexity calculate the metadata to obtain feature values, and determine whether the data features are extracted based on whether the feature values meet preset conditions. The preset conditions can be set according to user needs, and the method is more applicable.

In one possible implementation, the operator recommendation module is specifically used to determine the first operator set based on a reference feature set and the target feature, wherein the reference feature is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the two layers of operators.

Based on the above technical solution, the operator recommendation module determines the operator set of the target feature based on the feature set corresponding to the preset training data, so as to recommend a more suitable operator set.

In a possible implementation, the operator recommendation module is specifically used to determine the similarity between the target feature and the reference features in the reference feature set. When the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.

In a possible implementation, the system further includes: an operator evaluation module, configured to evaluate the complexity of a custom operator, wherein the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.

On the second aspect, a data analysis method in a cloud service system is provided. The method can be applied to a data analysis system architecture, or can be executed by components (such as chips or circuits) in the cloud service system architecture, without limitation.

The method includes: a feature extraction module receives a first data set determined by a user from a data system; the feature extraction module determines metadata corresponding to data of a first data subset based on the first data set, the first data subset being the first data set or a subset of the first data set; based on at least two layers of feature operators, a target feature of the metadata is determined, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators is used to extract the target feature based on the metadata, the second layer of operators is used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, the complexity of the second layer of operators is higher than the complexity of the first layer of operators; an operator recommendation module determines a first operator set based on the target feature, the first operator set is used for analyzing the first data set.

In a possible implementation, the first layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.

In one possible implementation, the operator recommendation module determines the first operator set based on a reference feature set and the target feature, wherein the reference feature is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the two layers of operators.

In one possible implementation, the operator recommendation module determines the similarity between the target feature and the reference features in the reference feature set. When the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.

In a possible implementation, the system further includes: an operator evaluation module that evaluates the complexity of the custom operator, and the complexity of the custom operator is used to determine whether to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.

In a third aspect, a cloud service system is provided, the system comprising: at least one processor, configured to execute a computer program or instruction stored in a memory, so as to execute the method in any possible implementation of the second aspect. Optionally, the device further comprises a memory, configured to store a computer program or instruction. Optionally, the device further comprises a communication interface, and the processor reads the computer program or instruction stored in the memory through the communication interface.

In a fourth aspect, the present application provides a processor, comprising: an input circuit, an output circuit, and a processing circuit. The processing circuit is used to receive a signal through the input circuit and transmit a signal through the output circuit, so that the processor executes any possible operation in the second aspect. Methods in implementation.

In the specific implementation process, the processor may be one or more chips, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a trigger, and various logic circuits. The input signal received by the input circuit may be received and input by, for example, but not limited to, a transceiver, and the signal output by the output circuit may be, for example, but not limited to, output to a transmitter and transmitted by the transmitter, and the input circuit and the output circuit may be the same circuit, which is used as an input circuit and an output circuit at different times. The embodiments of the present application do not limit the specific implementation methods of the processor and various circuits.

For the operations such as sending and acquiring/receiving involved in the processor, unless otherwise specified, or unless they conflict with their actual function or internal logic in the relevant description, they can be understood as operations such as processor output, reception, input, etc., or as sending and receiving operations performed by the radio frequency circuit and antenna, and this application does not limit this.

In a fifth aspect, a processing device is provided, comprising a processor and a memory. The processor is used to read instructions stored in the memory, and can receive signals through a transceiver and transmit signals through a transmitter to execute the method in any possible implementation of the second aspect.

Optionally, the number of the processors is one or more, and the number of the memories is one or more.

Optionally, the memory may be integrated with the processor, or the memory may be provided separately from the processor.

In the specific implementation process, the memory can be a non-transitory memory, such as a read-only memory (ROM), which can be integrated with the processor on the same chip or can be separately set on different chips. The embodiments of the present application do not limit the type of memory and the setting method of the memory and the processor.

It should be understood that the related data interaction process, such as sending indication information, can be a process of outputting indication information from the processor, and receiving capability information can be a process of receiving input capability information by the processor. Specifically, the data output by the processor can be output to the transmitter, and the input data received by the processor can come from the transceiver. Among them, the transmitter and the transceiver can be collectively referred to as a transceiver.

The processing device in the fifth aspect may be one or more chips. The processor in the processing device may be implemented by hardware or software. When implemented by hardware, the processor may be a logic circuit, an integrated circuit, etc.; when implemented by software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor or located outside the processor and exist independently.

In a sixth aspect, a chip is provided, which obtains instructions and executes the instructions to implement the method in the above-mentioned second aspect and any one of the implementation methods of the second aspect.

Optionally, as an implementation manner, the chip includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface to execute the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.

Optionally, as an implementation method, the chip may also include a memory, in which instructions are stored, and the processor is used to execute the instructions stored in the memory. When the instructions are executed, the processor is used to execute the method in the second aspect and any one of the implementation methods of the second aspect.

In a seventh aspect, a computer program product is provided, the computer program product comprising: a computer program code, when the computer program code is run on a computer, the computer executes the method in the above-mentioned second aspect and any one of the implementations of the second aspect.

In an eighth aspect, a computer-readable storage medium is provided, comprising instructions; the instructions are used to implement the method in the above-mentioned second aspect and any one of the implementation manners of the second aspect.

By way of example, these computer readable storages include, but are not limited to, one or more of the following: read-only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), Flash memory, electrically EPROM (EEPROM), and hard drive.

Optionally, as an implementation manner, the above-mentioned storage medium may specifically be a non-volatile storage medium.

In a ninth aspect, a computing device is provided, comprising a processor and a memory, wherein the processor of the computing device is used to execute instructions stored in the memory so that the computing device executes any possible implementation method of the second aspect.

In a tenth aspect, a computing node cluster is provided, which includes at least one computing node, each computing node includes a processor and a memory, and the processor of the at least one computing node is used to execute instructions stored in the memory of the at least one computing node, so that the computing node cluster executes any possible implementation method of the second aspect above.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of a system architecture provided in an embodiment of the present application.

FIG. 2 is a schematic block diagram of a data analysis system provided in an embodiment of the present application.

FIG3 is a schematic flow chart of an operator layering method provided in the implementation of the present application.

FIG4 is a schematic diagram of the structure of a feature extraction module provided in an embodiment of the present application.

FIG5 is a schematic diagram of a data analysis method provided in an embodiment of the present application.

FIG6 is a flowchart of an operator recommendation provided in an embodiment of the present application.

FIG. 7 is a schematic block diagram of a computing device 100 provided in the present application.

FIG8 is a schematic block diagram of a computing device cluster provided by the present application.

FIG. 9 is a schematic block diagram of another computing device cluster provided by the present application.

Detailed ways

The technical solution in this application will be described below in conjunction with the accompanying drawings.

The present application will present various aspects, embodiments or features around a system including multiple devices, components, modules, etc. It should be understood and appreciated that each system may include additional devices, components, modules, etc., and/or may not include all devices, components, modules, etc. discussed in conjunction with the figures. In addition, combinations of these schemes may also be used.

In addition, in the embodiments of the present application, words such as "exemplary" and "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplary" in the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of the word "exemplary" is intended to present concepts in a concrete way.

In the embodiments of the present application, “corresponding” and “relevant” may sometimes be used interchangeably. It should be noted that when the distinction between them is not emphasized, the meanings they intend to express are consistent.

The business scenarios described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. A person of ordinary skill in the art can appreciate that, with the evolution of network architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

References to "one embodiment" or "some embodiments" etc. described in this specification mean that a particular feature, structure or characteristic described in conjunction with the embodiment is included in one or more embodiments of the present application. Thus, the phrases "in one embodiment", "in some embodiments", "in some other embodiments", "in some other embodiments", etc. appearing in different places in this specification do not necessarily all refer to the same embodiment, but mean "one or more but not all embodiments", unless otherwise specifically emphasized in other ways. The terms "including", "comprising", "having" and their variations all mean "including but not limited to", unless otherwise specifically emphasized in other ways.

In this application, "at least one" means one or more, and "plurality" means two or more. "And/or" describes the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B can mean: including the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A and B can be singular or plural. The character "/" generally indicates that the previous and next associated objects are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple.

FIG1 shows a schematic diagram of a system architecture provided by an embodiment of the present application. As shown in FIG1 , a client can access a cloud management platform via the Internet. The cloud management platform is connected to the internal network of a data center. Typically, a data center includes multiple servers. The data center shown in the figure includes two servers. Taking server #1 as an example, for example, server #1 includes a software layer and a hardware layer. For example, the software layer can include multiple virtual machines, and a host operating system, and the host operating system includes a virtual machine manager and a cloud management platform client; for example, the hardware layer can include a processor, memory, hard disk, network card, and data bus, etc.

The cloud management platform is used to provide an access interface (for example, the cloud management platform is used to provide an interface or an application programming interface (API)). The tenant can operate the client remote access access interface to register a cloud account and password on the cloud management platform and log in to the cloud management platform. After the cloud management platform successfully authenticates the cloud account and password, the tenant can further pay to select and purchase a virtual machine of specific specifications (processor, memory, disk) on the cloud management platform. After the payment is successful, the cloud management platform provides the remote login account and password of the purchased virtual machine, and the client can remotely log in to the virtual machine, install and run the tenant's application in the virtual machine. The cloud management platform client can be used to receive the control plane command sent by the cloud management platform, create and manage the virtual machine on the server according to the control plane control command, and perform full life cycle management on the virtual machine. Therefore, the tenant can create, manage, log in and operate the virtual machine in the cloud data center through the cloud management platform. The virtual machine can also be called "cloud server (elastic compute service, ECS)", "elastic instance", etc. Different cloud service providers have different names.

To facilitate understanding of the embodiments of the present application, some terms appearing in the embodiments of the present application are explained.

1. Operator Set

Combination of operators. In some scenarios, a single operator cannot meet the requirements, and at least two operators need to be combined to calculate the data.

2. Time Series Database

The database that stores time series data can accommodate large-scale time series data and support basic data analysis such as query and compression of time series data as well as aggregation, downsampling, and statistics in time series scenarios.

3. Timing Analysis

Time series analysis is a high-level analysis method for time series data that is independent of the basic analysis of time series databases. It includes time series anomaly detection, time series prediction, clustering, association analysis, etc. It uses different analysis methods to mine the deeper value of time series data. Common methods include statistical methods, Bayesian analysis methods, deep learning methods, and machine learning methods.

4. Feature extraction

Feature extraction has many applications in machine learning, pattern recognition and image processing. Feature extraction starts from an initial measured data set and then constructs informative and non-redundant derived values, called features. It can help subsequent learning and inductive steps, and in some cases make it easier for people to make better interpretations of the data. Feature extraction is a dimensionality reduction step, where the initial data set is reduced to more manageable groups (features) for learning, while maintaining the accuracy and completeness of the description of the original data set.

5. Recommendation System

Recommendation system (RS) mainly refers to the technology of using collaborative intelligence for recommendation. Personalized recommendation system can effectively solve the problem of information overload. Recommendation system provides users with a sorted personalized list of item recommendations based on their historical preferences and constraints. A more accurate recommendation system can enhance and improve the user experience. Recommendation results can usually be generated based on user preferences, product features, user-product transactions, and other environmental factors (such as time, season, location, etc.). Recommended items can include movies, books, restaurants, news items, etc.

6. Data characteristics

In the present application, data features include data features of the data itself and/or extracted data features. Among them, the features of the data itself are the features of the data in the time series. For example, including data arrangement period, data change trend or data fluctuation, etc., correspondingly, the data of data features include: data of data arrangement period, data change trend data or data fluctuation data, etc. The data arrangement period refers to the period involved in the data arrangement in the time series if the data is arranged periodically in the time series. For example, the data of the data arrangement period includes the period duration (that is, the time interval between two periods) and/or the number of periods; the data change trend data is used to reflect the changing trend of the data arrangement in the time series (that is, the data change trend), for example, the data includes: continuous growth, continuous decline, first rise and then fall, first fall and then rise, or meet the normal distribution, etc.; the data fluctuation data is used to reflect the fluctuation state of the data in the time series (that is, data fluctuation), for example, the data includes a function that characterizes the fluctuation curve of the time series, or a specified value of the time series, such as the maximum value, minimum value or average value. Among them, the extracted data features are the features in the process of extracting the data in the time series. For example, the extracted features include statistical features, fitting features or frequency domain features, etc. Correspondingly, the extracted feature data include statistical feature data, fitting feature data or frequency domain feature data, etc. Statistical features refer to the statistical features of time series. Statistical features are divided into quantitative features and attribute features, among which quantitative features are divided into measurement features and counting features. Quantitative features can be directly expressed by numerical values. For example, the consumption values of various resources such as CPU, memory, and IO resources are measurement features; the number of abnormalities and the number of devices working normally are counting features; attribute features cannot be directly expressed by numerical values, such as whether the device is abnormal or whether the device is downtime, etc. The features in the statistical features are the indicators that need to be examined when statistics are performed. For example, the statistical feature data include moving average (Moving_average), weighted average (Weighted_mv), etc.; fitting features are the features of time series when fitting, and the fitting feature data is used to reflect the features of the time series used for fitting, for example, the fitting feature data includes the algorithm used when fitting, such as ARIMA; frequency domain features are the features of the time series in the frequency domain, and the frequency domain features are used to reflect the features of the time series in the frequency domain. For example, the frequency domain feature data includes: data on the regularity followed by the distribution of the time series in the frequency domain, such as the proportion of high-frequency components in the time series. Optionally, the frequency domain feature data can be obtained by performing wavelet decomposition on the time series.

In the existing technology, basic time series anomaly detection and prediction scenarios can be provided for time series data with large data volumes and complex and diverse formats. For example, cloud vendor AWS can provide low-computational visualization and intelligent analysis capabilities based on field attributes and data dimensions based on metadata. For another example, a high-performance RCF algorithm, a single algorithm can meet simple user needs. For another example, the academic community hopes to consider metadata and the data itself, but the amount of calculation will increase linearly with the number of preset algorithms and the amount of data, making it difficult to implement. Therefore, as the types of data gradually increase and the database grows larger, the complexity of algorithm recommendations based on the database increases significantly, and the efficiency of algorithm recommendations for time series data is also very low.

In view of this, in this application, a data analysis system is proposed for complex databases, which can improve the efficiency of algorithm recommendation and enhance user experience.

FIG. 2 is a schematic diagram of a data analysis system 200 proposed in an embodiment of the present application.

As shown in Figure 2, the system consists of four parts: front end, computing engine, database and recommendation engine.

The front end is used for users to execute operation commands and display operation results.

For example, users can select the data source of interest on the UI front end, and the front end can display the recommendation page generated by the back end based on the selected data source, and generate a recommended operator set based on the automatically defined analysis task. For another example, users can enter custom operators on the front end to expand feature operators based on preset operators.

The calculation engine is used to evaluate the user-defined operator. The calculation engine includes an evaluation module, based on which the complexity of the user-defined operator is evaluated and the evaluation result is input into the database.

The database is used to store the operator library and the data input by the front end. For example, the custom operator evaluated by the computing engine can be added to the operator library, which includes multiple layers of operators of different complexity, and operators of different complexity are used to capture data features. The system automatically loads the corresponding metadata and partial data sampling for the data input by the front end.

The recommendation engine is used to recommend operators for metadata and data sampling. The recommendation engine includes a feature extraction module and an operator recommendation module. The feature extraction module is used to calculate the feature values corresponding to different feature operators. The operator recommendation module is used to determine the operator set corresponding to the features obtained by the feature extraction module based on the feature values obtained by the training data set and the preset algorithm set, thereby realizing the recommendation of the operator set.

In the present application, training data and preset operator sets are stored in the database. The training data can be understood as a data set determined by the user based on the importance of historical application needs. The training data has been labeled, that is, the corresponding operators of the training data are known; the preset operator set is a set of feature operators determined by the user based on historical calculations.

It should be noted that, in the present application, the feature extraction module uses a feature operator to calculate the data to obtain a feature value, wherein the feature operator can be processed in layers.

The specific hierarchical processing flow is as shown in FIG3 .

Step a: Calculate the labeled training data set using all preset operator sets to obtain the eigenvalues of each training data set and the corresponding operator set, that is, the relationship matrix between the operator and the eigenvalue.

Step b: obtain the relational matrix between operators through convolution between operators.

Step c: weighting the relationship between operators and the performance of operators to obtain operator stratification.

It can be understood that operator stratification is to divide the associations and performance factors between the preset operators and comprehensive operators into different layers, and operators in different layers have different complexities.

For example, FIG4 is a schematic diagram of the structure of a feature extraction module proposed in an embodiment of the present application.

FIG4 shows an operator hierarchical structure, where the first layer of operators are ultra-lightweight operators, the second layer of operators are lightweight operators, and the third layer of operators are heavy operators. The complexity of ultra-lightweight operators is lower than that of lightweight operators, and the complexity of lightweight operators is lower than that of heavy operators.

It should be understood that the relationship between the above operators can be understood as data that cannot be processed by the lower-level operators can be processed by the upper-level operators. That is, if the ultra-lightweight operator cannot process it, it can be processed by the lightweight operator; if the lightweight operator cannot process it, it can be processed by the heavy operator.

The three levels of operator stratification shown in FIG4 are only exemplary. In fact, the operator stratification may include at least two layers of feature operators. For example, the feature extraction module includes a first layer of operators and a second layer of operators. For another example, the feature extraction module includes a first layer of operators, a second layer of operators, a third layer of operators, and a fourth layer of operators. This embodiment of the application is not limited to this.

It should be understood that after the operators are layered, the data that can be processed by the lower-layer operators does not need to be processed by the upper-layer operators, that is, no more calculations are performed. In other words, most of the data are stable and can be processed by the lower-layer operators to extract data features, thereby significantly improving the efficiency of data feature extraction.

It should be understood that the results of operator stratification can be pre-configured on the analysis platform without affecting online performance.

It should be noted that in this application, users can add custom operators to the original preset operator set. These custom operators need to be calculated once with the training data set, and then calculate the relationship matrix with other operators. Since the built-in operators have formed clusters, there is no need to calculate the relationship matrix with all the built-in operators one by one, only the clusters need to be calculated.

It should be noted that the data analysis system proposed in the present application can be used for the analysis of time series data as well as for the analysis of other types of data, and the embodiments of the present application are not limited to this.

The following describes in detail the method for providing data analysis in the system architecture of FIG. 2 in conjunction with FIG. 5 .

Fig. 5 is a flow chart of a method 500 for providing data analysis proposed in an embodiment of the present application, which specifically includes steps S510 to S530.

S510, the feature extraction module is used to receive a first data set determined by a user from a data system.

The first data set is a data source selected by a user from a data system, for example, a data source of interest selected by the user.

The first data set serves as input to the feature extraction module.

In this application, the data system may be a database, etc.

In this application, the first data set is described by taking the first time series as an example.

It can be understood that in the present application, the first data set may include other data types, such as non-time series data, etc., and the embodiments of the present application are not limited to this.

S520: The feature extraction module determines metadata corresponding to the data of the first data subset according to the first data set.

The first data subset may be the first data set, or a subset of the first data set.

In the present application, the feature extraction module can determine metadata corresponding to the data of the first data set based on the first data set, that is, after the user selects a data source of interest, the system can automatically load the metadata corresponding to the data source.

Alternatively, the feature extraction module may determine metadata corresponding to the subset of the first data set based on the subset of the first data set, that is, after the user selects a data source of interest, the system may automatically load sampled data of the data source and metadata corresponding to the sampled data.

It should be understood that for a set of time series data with a large amount of data and complex format, if only metadata is considered for analysis, the recommended operators will be biased. Comprehensive consideration of metadata and partial data sampling can reduce the amount of calculation compared to considering all data, and can improve the effect of recommended operators compared to considering only metadata.

S530: The feature extraction module determines target features of the metadata based on at least two layers of feature operators.

In the present application, the target features include data features of the first time series.

Exemplarily, the target feature may include a target feature vector.

The feature extraction module calculates the input metadata to determine the target features.

In the present application, the feature extraction module includes at least two layers of feature operators, namely, a first layer operator and a second layer operator. The first layer operator is used to extract target features based on metadata. When the first layer operator fails to extract the target features, the second layer operator extracts the target features based on metadata. The complexity of the second layer operator is higher than that of the first layer operator.

It should be understood that the feature extraction module may also include a third layer operator, a fourth layer operator, etc. The embodiment of the present application takes the first layer operator and the second layer operator of at least two layers of operators as an example, and the embodiment of the present application does not limit this.

The following is a detailed description of the process of extracting target features by the multi-layer feature extraction operator in the feature extraction module.

After the metadata is input into the feature extraction module, it first passes through the first layer operator and obtains the first eigenvalue through calculation by the first layer operator. When the first eigenvalue meets the first condition, the first layer operator extracts the target feature and completes the feature extraction process. If the first eigenvalue does not meet the first condition, the metadata enters the second layer operator and obtains the second eigenvalue through calculation by the second layer operator. When the second eigenvalue meets the first condition, the second layer operator extracts the target feature and completes the feature extraction process.

It can be understood that if the second eigenvalue does not meet the first condition, the third layer operator calculates the third eigenvalue to determine whether the third eigenvalue meets the first condition, thereby determining whether the target feature is extracted, and so on, until the first condition is met and the feature extraction process is ended.

The eigenvalue may specifically be an eigenvalue vector, which includes multiple eigenvalues. The following is the definition of an eigenvalue vector:
C＝(c ₁ c ₂ c ₃ … c _n )

C is the eigenvalue, n is the number of eigenoperators, and each eigenoperator corresponds to an eigenvalue. The eigenvalue corresponding to the uncalculated eigenoperator is 0. The uncalculated eigenoperator can be understood as follows: when the first eigenvalue vector satisfies the first condition and the second-layer operator calculation is not required, the second-layer operator does not need to be calculated, and the eigenvalue corresponding to the eigenoperator in the second-layer operator is 0.

In the present application, extracting the target feature can be understood as the feature value calculated by the feature operator satisfies the first condition.

The first condition may be a threshold value.

Exemplarily, when the eigenvalues in the eigenvalue vector are all greater than a threshold, that is, it is considered that the first condition is satisfied and the data feature is captured.

The first condition may also be an interval.

Exemplarily, when the eigenvalues in the eigenvalue vector are all within the interval, that is, it is considered that the first condition is satisfied and the data feature is captured.

The first condition may be any preset condition, and the embodiment of the present application does not limit this.

It should be understood that the first layer of operators is the lowest complexity operator layer, the second layer of operators is the operator layer with higher complexity than the first layer of operators, and if there is a third or fourth layer, the complexity increases layer by layer. Feature extraction starts from the lowest complexity operator layer. If the lower layer operator can capture the target feature, the feature extraction ends and no further calculation is required. Otherwise, it is necessary to go up layer by layer until the target feature is captured.

The following takes a feature extraction module shown in FIG4 as an example to explain in detail how the step-by-step feature extraction framework determines the target feature vector. Specific method.

The metadata is first calculated by the ultra-lightweight operator in this module to evaluate whether the data features can be captured. If the target features can be captured, the feature extraction process is completed and the eigenvalue vector is returned. If the target features are not captured, the lightweight operator continues to calculate. If the target features are captured, the feature extraction process is completed and the eigenvalue vector is returned. If the target features are not captured, the heavy operator continues to calculate and capture the target features.

S540: The operator recommendation module determines a first operator set based on the target feature, where the first operator set is used to perform data analysis on the first data set.

The operator recommendation module is specifically configured to determine a first operator set based on a reference feature set and the target feature.

The reference feature set is a set of feature values calculated based on the training data set using a preset operator set.

Fig. 6 is a flowchart of an operator recommendation proposed in an embodiment of the present application. The operator recommendation process is described in detail in conjunction with Fig. 6.

Step a: All training data sets are calculated using all preset operator sets T to obtain the eigenvalues corresponding to each operator set. The eigenvalues corresponding to all operator sets constitute the reference feature set.

In this application, the preset operator set T, m is the number of operator sets.
T＝(t ₁ t ₂ t ₃ … t _m )

The preset operator set includes part or all of the operators in at least two layers of operators in the feature extraction module.

It should be understood that the data in the training data set has been labeled, that is, each training set in the training data set has a corresponding operator set, so a mapping relationship matrix between the operator set and the eigenvalue can be constructed.

Among them, R is the mapping relationship matrix between operator sets and eigenvalues.
R _i =(r _c1ti r _c2ti r _c3ti ... r _cnti )

Among them, Ri is the eigenvector corresponding to operator set i.

Step b: Determine the recommended operator result based on the mapping relationship matrix between the operator set and the eigenvalues in combination with the metadata.

Specifically, the operator recommendation module can determine the similarity between the target feature and the reference feature in the reference feature set. When the similarity between the target feature and the first reference feature is greater than the similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.

It should be understood that the similarity between the target feature and the first reference feature is negatively correlated with the distance between the target feature and the first reference feature. That is, the greater the similarity of the two features, the smaller the distance, and the smaller the similarity, the greater the distance.

The features in the present application may be in the form of feature vectors, and the following description will be given using feature vectors as an example.

That is, the distance between the target feature vector and the first reference feature vector may be determined first, and the similarity may be determined based on the obtained distance.

The distance between the target feature vector and the first reference feature vector can be obtained in a variety of ways, which is not limited in this embodiment of the present application.

In a possible implementation, the newly selected data (metadata) is extracted through step features to obtain a target feature vector Cu.
U _i = DT(R _i , _Cu )

ui is the distance of the target feature vector relative to the reference feature vector.

In an optional manner, the first operator set corresponding to the reference feature vector with the shortest distance may be determined as the operator set of the target feature vector.

In another optional implementation, a similarity threshold is preset, and when the similarity reaches the threshold, the first operator set corresponding to the first reference feature vector can be considered as the operator set of the target feature vector.

In the present application, the data driver module, the transmission control module, and the information acquisition module can be implemented by software or by hardware. As an example, the implementation of the data driver module is described below. Similarly, the implementation of the transmission control module and the information acquisition module can refer to the implementation of the data driver module.

In this application, "module" is used as an example of a software functional unit. The feature extraction module and the operator recommendation module may include code running on a computing instance. Among them, the computing instance may be at least one of a physical host (computing device), a virtual machine, a container and other computing devices. Furthermore, the above-mentioned computing device may be one or more. For example, a data-driven module may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the application can be distributed in the same region or in different regions. The multiple hosts/virtual machines/containers used to run the code can be distributed in the same AZ or in different AZs, and each AZ includes one data center or multiple data centers with close geographical locations. Among them, usually one A region can include multiple AZs.

Similarly, multiple hosts/virtual machines/containers used to run the code can be distributed in the same VPC or in multiple VPCs. Usually, a VPC is set up in a region. For cross-region communication between two VPCs in the same region and between VPCs in different regions, a communication gateway must be set up in each VPC to achieve interconnection between VPCs through the communication gateway.

As an example of a hardware functional unit, the feature extraction module may include at least one computing device, such as a server, etc. Alternatively, the feature extraction module may also be a device implemented using ASIC or PLD, etc. The PLD may be implemented using CPLD, FPGA, GAL or any combination thereof.

The multiple computing devices included in the feature extraction module can be distributed in the same region or in different regions. The multiple computing devices included in the feature extraction module can be distributed in the same AZ or in different AZs. Similarly, the multiple computing devices included in the feature extraction module can be distributed in the same VPC or in multiple VPCs. The multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

The present application also provides a computing device 100. As shown in FIG. 7 , the computing device 100 includes: a bus 102, a processor 104, a memory 106, and a communication interface 108. The processor 104, the memory 106, and the communication interface 108 communicate with each other through the bus 102. The computing device 100 may be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 100.

The bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG8 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 104 may include a path for transmitting information between various components of the computing device 100 (e.g., the memory 106, the processor 104, the communication interface 108).

The processor 104 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).

The memory 106 may include a volatile memory, such as a random access memory (RAM). The processor 104 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).

The memory 106 stores executable program codes, and the processor 104 executes the executable program codes to respectively implement the functions of the aforementioned computing engine, database, feature extraction module, and operator recommendation module, thereby implementing the method of providing data analysis. That is, the memory 106 stores instructions for executing data analysis.

Alternatively, the memory 106 stores executable codes, and the processor 104 executes the executable codes to respectively implement the functions of the aforementioned computing engine, database, feature extraction module, and operator recommendation module, thereby implementing the method for providing data analysis. That is, the memory 106 stores instructions for performing data analysis.

The communication interface 103 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.

The embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.

As shown in Fig. 8, the computing device cluster includes at least one computing device 100. The memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for performing data analysis.

In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for performing data analysis. In other words, the combination of one or more computing devices 100 may jointly execute instructions for performing data analysis.

It should be noted that the memory 106 in different computing devices 100 in the computing device cluster can store different instructions, and can implement the functions of one or more modules among the computing engine, database, feature extraction module and operator recommendation module.

In some possible implementations, one or more computing devices in a computing device cluster may be connected via a network. The network may be a wide area network or a local area network, etc. FIG. 9 shows a possible implementation. As shown in FIG. 9 , two computing devices 100A and 100B are connected via a network. Specifically, the network is connected via a communication interface in each computing device. In this type of possible implementation, the memory 106 in the computing device 100A stores instructions for executing the functions of a computing engine and a database. At the same time, the memory 106 in the computing device 100B stores instructions for the functions of a feature extraction module and an operator recommendation module.

It should be understood that the functions of the computing device 100A shown in FIG9 may also be performed by multiple computing devices 100. The functionality of 100B may also be performed by multiple computing devices 100 .

The embodiment of the present application also provides another computing device cluster. The connection relationship between the computing devices in the computing device cluster can be similar to the connection mode of the computing device cluster described in Figures 8 and 9. The difference is that the memory 106 in one or more computing devices 100 in the computing device cluster can store the same instructions for executing the control transmission control scheme.

It should be noted that the memory 106 in different computing devices 100 in the computing device cluster may store different instructions for executing part of the functions of the data analysis system. That is, the instructions stored in the memory 106 in different computing devices 100 may implement the functions of one or more modules among the computing engine, database, feature extraction module, and operator recommendation module.

The embodiment of the present application also provides a computer program product including instructions. The computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device executes method 500.

The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center that contains one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk). The computer-readable storage medium includes instructions that instruct the computing device to execute method 500.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.

The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art who is familiar with the present technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A data analysis system, characterized in that the system comprises:

A feature extraction module, configured to receive a first data set determined by a user from a data system;

Determining metadata corresponding to data of a first data subset according to the first data set, where the first data subset is the first data set or a subset of the first data set;

Determine the target feature of the metadata based on at least two layers of feature operators, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators are used to extract the target feature based on the metadata, the second layer of operators are used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, and the complexity of the second layer of operators is higher than the complexity of the first layer of operators;

An operator recommendation module is used to determine a first operator set based on the target feature, where the first operator set is used for analyzing the first data set.
The system according to claim 1 is characterized in that the first layer operator is specifically used to calculate the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator is specifically used to calculate the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.
The system according to claim 1 or 2 is characterized in that the operator recommendation module is specifically used to determine the first operator set based on a reference feature set and the target feature, the reference feature set is a set of feature values calculated based on a training data set using a preset operator set, and the preset operator set includes some or all of the operators of the at least two layers of operators.
The system according to claim 3 is characterized in that the operator recommendation module is specifically used to determine the similarity between the target feature and the reference features in the reference feature set. When the similarity between the target feature and the first reference feature is greater than a similarity threshold, the first operator set corresponding to the first reference feature is determined to be the operator set of the target feature, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
The system according to any one of claims 1-4 is characterized in that the system further comprises: an operator evaluation module, used to evaluate the complexity of the custom operator, and the complexity of the custom operator is used to determine to embed the custom operator into one of the at least two layers of feature operators in the feature extraction module.
A data analysis method in a cloud service system, characterized by comprising:

The feature extraction module receives a first data set determined by a user from a data system;

The feature extraction module determines metadata corresponding to data of a first data subset based on the first data set, where the first data subset is the first data set or a subset of the first data set;

Determine the target feature of the metadata based on at least two layers of feature operators, the two layers of feature operators include a first layer of operators and a second layer of operators, the first layer of operators are used to extract the target feature based on the metadata, the second layer of operators are used to extract the target feature based on the metadata when the first layer of operators fail to extract the target feature, and the complexity of the second layer of operators is higher than the complexity of the first layer of operators;

The operator recommendation module determines a first operator set based on the target feature, where the first operator set is used for analyzing the first data set.
The method according to claim 6 is characterized in that the first layer operator calculates the metadata to obtain a first eigenvalue, and when the first eigenvalue satisfies a first condition, the first layer operator extracts the target feature; when the first eigenvalue does not satisfy the first condition, the second layer operator calculates the metadata to obtain a second eigenvalue, and when the second eigenvalue satisfies the first condition, the second layer operator extracts the target feature.
The method according to claim 6 or 7, characterized in that the operator recommendation module determines the first operator set based on the target feature, comprising:

The operator recommendation module determines the first operator set based on a reference feature set and the target feature, wherein the reference feature is a feature value set calculated based on a training data set using a preset operator set, and the preset operator set includes some or all operators of the two layers of operators.
The method according to claim 8, characterized in that the operator recommendation module determines the first operator set based on the reference feature set and the target feature, comprising:

The operator recommendation module determines the similarity between the target feature and the reference features in the reference feature set, and when the similarity between the target feature and the first reference feature is greater than a similarity threshold, determines that the first operator set corresponding to the first reference feature is the target feature. Operator set, the first operator set belongs to the preset operator set, and the first reference feature belongs to the reference feature set.
The method according to any one of claims 6 to 9 is characterized in that the method further comprises: an operator evaluation module evaluating the complexity of the custom operator, the complexity of the custom operator being used to determine to embed the custom operator into one of at least two layers of feature operators in the feature extraction module.
A computing device, comprising: a processor and a memory; the processor runs instructions in the memory, so that the computing device executes the method as claimed in any one of claims 6 to 10.
A computing device cluster, characterized in that it includes at least one computing device, each computing device includes a processor and a memory;

The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method according to any one of claims 6 to 10.
A computer program product comprising instructions, wherein when the instructions are executed by a computer device cluster, the computer device cluster executes the method according to any one of claims 6 to 10.
A computer-readable storage medium, characterized in that it includes computer program instructions. When the computer program instructions are executed by a computer device cluster, the computer device cluster executes the method as described in any one of claims 6 to 10.