WO2023029275A1 - 数据关联分析方法、装置、计算机设备和存储介质 - Google Patents

数据关联分析方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2023029275A1
WO2023029275A1 PCT/CN2021/136435 CN2021136435W WO2023029275A1 WO 2023029275 A1 WO2023029275 A1 WO 2023029275A1 CN 2021136435 W CN2021136435 W CN 2021136435W WO 2023029275 A1 WO2023029275 A1 WO 2023029275A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
target
heterogeneous data
metadata information
heterogeneous
Prior art date
Application number
PCT/CN2021/136435
Other languages
English (en)
French (fr)
Inventor
李成森
王诗琦
王广林
Original Assignee
广州广电运通金融电子股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州广电运通金融电子股份有限公司 filed Critical 广州广电运通金融电子股份有限公司
Publication of WO2023029275A1 publication Critical patent/WO2023029275A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of data analysis, in particular to a data association analysis method, device, computer equipment and storage medium.
  • the processing method of data analysis is usually to establish a unified data warehouse for data collection, and then do data analysis based on the data warehouse.
  • unstructured data such as Excel and reports are difficult to associate with the data in the database, and there is a technical problem of low data association analysis efficiency.
  • this method requires analysts to master SQL syntax, which is a problem for those who only know how to analyze spreadsheets such as Excel.
  • a data association analysis method comprising:
  • Extract metadata information of heterogeneous data in various heterogeneous data sources form an intermediate table containing metadata information of various heterogeneous data, and store the intermediate table in the memory database;
  • the target heterogeneous data corresponding to the target metadata information is obtained from the corresponding heterogeneous data source and stored in the memory database;
  • the method also includes:
  • Displaying a user interface the user interface is used for the user to deploy the association analysis process
  • the components include at least the data component
  • the corresponding components displayed in the user interface are connected to form the association analysis process.
  • the number of the intermediate tables is multiple; the metadata information associated with the data components in the associated analysis process in the memory database as the target metadata information includes:
  • the metadata information selected by the user in the target intermediate table is used as the target metadata information.
  • the target heterogeneous data corresponding to the target metadata information is obtained from the corresponding heterogeneous data source and stored in the memory database, including:
  • the target heterogeneous data is acquired and stored in the memory database.
  • the method also includes:
  • the at least two types of heterogeneous data are respectively stored in different hard disk databases to obtain the at least two types of heterogeneous data sources.
  • the at least two types of heterogeneous data sources include structured data sources and unstructured data sources; the at least two types of heterogeneous data include structured data and unstructured data.
  • the method also includes:
  • a data association analysis device comprising:
  • a data determination module configured to determine at least two types of heterogeneous data sources
  • the data extraction module is used to extract metadata information of heterogeneous data in various heterogeneous data sources, form an intermediate table containing metadata information of various heterogeneous data, and store the intermediate table in the memory database;
  • the data association module is used to use the metadata information associated with the data components in the association analysis process in the memory database as the target metadata information;
  • the data storage module is used to obtain target heterogeneous data corresponding to the target metadata information from corresponding heterogeneous data sources according to the target metadata information and store them in the memory database;
  • the data acquisition module is used to run the association analysis process to analyze the target heterogeneous data in the memory database, and obtain the data association analysis results.
  • a computer device comprising a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • the above-mentioned data association analysis method, device, computer equipment, and storage medium determine at least two types of heterogeneous data sources, extract metadata information of heterogeneous data in each heterogeneous data source, and form a database containing metadata information of various types of heterogeneous data.
  • the intermediate table is stored in the memory database, and the metadata information associated with the data components in the associated analysis process in the memory database is used as the target metadata information, and the target is obtained from the corresponding heterogeneous data source according to the target metadata information
  • the target heterogeneous data corresponding to the metadata information is stored in the memory database, and the correlation analysis process is run to analyze the target heterogeneous data in the memory database to obtain the data correlation analysis results.
  • This scheme determines multiple heterogeneous data sources for storing corresponding heterogeneous data, extracts the corresponding metadata information of the heterogeneous data stored in each heterogeneous data source, and stores the metadata information in the corresponding multiple intermediate
  • the formed intermediate table is stored in the memory database
  • the metadata information in the memory database is associated with the data components in the association analysis process
  • the metadata information is used as the target metadata information
  • the target metadata information is The corresponding heterogeneous data is used as the target heterogeneous data
  • the target heterogeneous data in the corresponding heterogeneous data source is obtained and the target heterogeneous data is stored in the memory database
  • the correlation analysis process is run to analyze the target heterogeneous data in the memory database Analyze in memory to obtain data association analysis results, so as to realize fast association analysis of heterogeneous data and improve the efficiency of data association analysis.
  • Fig. 1 is an application environment diagram of the data association analysis method in an embodiment
  • Fig. 2 is a schematic flow chart of a data association analysis method in an embodiment
  • FIG. 3 is a schematic diagram of an interface for forming an association analysis process in an embodiment
  • FIG. 4 is a schematic flow diagram of an association analysis process in another embodiment
  • Fig. 5 is a structural block diagram of a data association analysis device in an embodiment
  • Figure 6 is an internal block diagram of a computer device in one embodiment.
  • the user-related information including but not limited to user equipment information, user personal information, etc.
  • data and its processing including but not limited to data for display, analysis, etc.
  • this application also provides a corresponding user authorization entry for the user to choose to authorize or choose to refuse.
  • the data association analysis method provided in this application can be applied to the application scenario shown in FIG. 1, and the application scenario can include: the terminal 100 and multiple heterogeneous data sources, and the terminal 100 and various heterogeneous data sources can be connected by communication .
  • the terminal 100 determines at least two types of heterogeneous data sources, and then the terminal 100 extracts the metadata information of the heterogeneous data in each heterogeneous data source, forms an intermediate table containing the metadata information of various types of heterogeneous data, and converts the intermediate table stored in the in-memory database, and then the terminal 100 uses the metadata information associated with the data components in the association analysis process in the in-memory database as the target metadata information, and then the terminal 100 acquires all The target heterogeneous data corresponding to the target metadata information is stored in the memory database, and then the terminal 100 runs an association analysis process to analyze the target heterogeneous data in the memory database to obtain a data association analysis result.
  • the terminal 100 may be, but not limited to, various personal computers, notebook computers,
  • a data association analysis method is provided.
  • the method is applied to the terminal 100 shown in FIG. 1 as an example.
  • the method may include the following steps:
  • Step S201 determining at least two types of heterogeneous data sources.
  • heterogeneous data refers to data with different structures from each other
  • at least two types of heterogeneous data sources refer to two or more types of data sources used to store data with different structures from each other, for example, at least two types of heterogeneous data sources
  • Data sources can include databases that store structured data and databases that store unstructured data.
  • the terminal 100 may first determine at least two types of heterogeneous data sources for storing corresponding heterogeneous data that the user needs to analyze.
  • At least two types of heterogeneous data sources can be obtained through the following steps.
  • the specific steps include: obtaining at least two types of heterogeneous data uploaded by users, and storing at least two types of heterogeneous data in different hard disk databases , to obtain at least two types of heterogeneous data sources.
  • the hard disk database refers to a database stored in a hard disk.
  • the terminal 100 acquires at least two types of heterogeneous data uploaded by users, stores the at least two types of heterogeneous data in different hard disk databases, and obtains at least two types of heterogeneous data sources.
  • the user can upload structured data and unstructured data
  • the terminal 100 receives the structured data uploaded by the user in an ETL (Extract-Transform-Load, extract-transform-load) manner and Stored in the structured data source on the hard disk, receive the unstructured data uploaded by the user and place the field name and data content of the data column according to the template file of the system, and store it in the unstructured data source on the hard disk.
  • ETL Extract-Transform-Load, extract-transform-load
  • the technical solution of this embodiment can obtain at least two types of heterogeneous data sources by storing at least two types of heterogeneous data uploaded by users in different hard disk databases, so that the terminal 100 can The type determines the corresponding multi-type heterogeneous data sources, so that the types of heterogeneous data that can be processed by the terminal 100 are more diverse, which is beneficial for the terminal 100 to perform data association analysis on more different types of heterogeneous data.
  • Step S202 extract metadata information of heterogeneous data in each heterogeneous data source, form an intermediate table containing metadata information of various heterogeneous data, and store the intermediate table in the memory database.
  • the metadata information refers to the data stored in the intermediate table containing the key information of the corresponding heterogeneous data, which can be used to associate with the corresponding heterogeneous data.
  • the terminal 100 can find the corresponding text content in the book according to the catalog of the book (that is, the metadata information) (ie, associate with the corresponding heterogeneous data);
  • the intermediate table is a data table stored in the memory database, and is mainly used to store metadata information corresponding to the heterogeneous data, wherein the memory database refers to a database in memory.
  • a certain intermediate table may be a data table storing book contents (that is, metadata information) in memory.
  • the terminal 100 extracts the metadata information corresponding to the heterogeneous data stored in each heterogeneous data source, stores the extracted metadata information in corresponding multiple intermediate tables, and forms Intermediate tables of metadata information are stored in an in-memory database.
  • step S203 the metadata information associated with the data components in the association analysis process in the memory database is used as the target metadata information.
  • the association analysis process refers to the process of performing association analysis on heterogeneous data, and the association analysis process may be composed of one or more data components and one or more non-data components connected to each other.
  • data components refer to components that represent data in the association analysis process, such as input data set components, output data set components, etc.
  • non-data components refer to components that perform corresponding operations on data in the association analysis process, such as Insert formula components, merge column components, etc.
  • target metadata information refers to metadata information associated with data components in the associated analysis process.
  • the terminal 100 can associate the metadata information in the memory database with the data components in the association analysis process, and use the metadata information associated with the data components as the target metadata information.
  • the input data set component needs to be associated with a certain If there is a catalog of the third chapter in this book, then the terminal 100 may use the metadata information corresponding to the catalog of the third chapter in the certain book in the memory database as the target metadata information.
  • the above step S203 specifically includes: receiving an intermediate table configuration instruction triggered by the user for the data component in the association analysis process, responding to the intermediate table configuration instruction, displaying multiple intermediate tables, and placing the user in the multiple intermediate tables
  • the selected intermediate table is used as the target intermediate table and the target intermediate table is configured for the data component, and the metadata information selected by the user in the target intermediate table is used as the target metadata information.
  • the intermediate table configuration instruction refers to an instruction for instructing the terminal 100 to display multiple intermediate tables for the user to select as a target intermediate table for data component configuration
  • the target intermediate table refers to An intermediate table selected by the user for configuring data components from multiple intermediate tables.
  • the user can trigger an intermediate table configuration instruction for one or more data components in the association analysis process, and the terminal 100 receives and responds to the intermediate table configuration instruction, and displays multiple intermediate tables for the user to select. Select the required intermediate table from the table, the terminal 100 uses the intermediate table selected by the user as the target intermediate table and configures the target intermediate table for the data component, then the user can select one or more metadata information in the target intermediate table, and the terminal 100 takes the metadata information selected by the user as the target metadata information.
  • the technical solution of this embodiment enables the terminal 100 to accurately record and Visually present the target intermediate tables and target metadata information configured by the data components in the process of data association analysis, which is conducive to faster input or replacement of heterogeneous data that needs to be analyzed for data association, thus facilitating the process of data association analysis Process management and improved analysis efficiency.
  • step S204 according to the target metadata information, the target heterogeneous data corresponding to the target metadata information is obtained from the corresponding heterogeneous data source and stored in the memory database.
  • the target heterogeneous data refers to the heterogeneous data originally stored in the heterogeneous data source corresponding to the target metadata information.
  • the metadata information is the book catalog
  • the target heterogeneous data can be the content of the third chapter in a book.
  • the above step S204 specifically includes: determining the heterogeneous data source to which the target metadata information belongs from at least two types of heterogeneous data sources, as the target heterogeneous data source, combining the target heterogeneous data source with the target metadata
  • the heterogeneous data corresponding to the data information is used as the target heterogeneous data, and the target heterogeneous data is obtained and stored in the memory database.
  • the target heterogeneous data source refers to the heterogeneous data source storing the target heterogeneous data among the aforementioned at least two types of heterogeneous data sources, to which the target metadata information belongs.
  • the terminal 100 searches for and determines the heterogeneous data source to which the target metadata information belongs among multiple types of heterogeneous data sources, and uses the heterogeneous data source as the target heterogeneous data source, and then the terminal 100 searches for the target heterogeneous data source in the target heterogeneous data source. Find and determine the heterogeneous data corresponding to the target metadata information, and use the heterogeneous data as the target heterogeneous data, and then the terminal 100 acquires the target heterogeneous data and stores the target heterogeneous data in the memory database.
  • the technical solution of this embodiment can determine the target heterogeneous data source according to the target metadata information to determine the target heterogeneous data and store the target heterogeneous data in the memory database, so that the terminal 100 can accurately obtain the target heterogeneous data required by the user.
  • structure data which is conducive to improving the accuracy of the data association analysis results obtained after running the association analysis process.
  • Step S205 run the association analysis process to analyze the target heterogeneous data in the memory database, and obtain the data association analysis result.
  • the data association analysis result refers to the data association analysis result obtained after performing association analysis on the target heterogeneous data through running the association analysis process.
  • the terminal 100 can run the association analysis process under the instruction of the user, and store the target heterogeneous data in the memory database associated with the data components in the association analysis process. After the association analysis process is completed, the data association analysis results are obtained.
  • the above data association analysis method at least two types of heterogeneous data sources are determined, metadata information of heterogeneous data in each heterogeneous data source is extracted, an intermediate table containing metadata information of various heterogeneous data is formed, and the intermediate table is stored
  • the metadata information associated with the data components in the association analysis process in the memory database is used as the target metadata information, and according to the target metadata information, the target heterogeneity corresponding to the target metadata information is obtained from the corresponding heterogeneous data source.
  • the data is structured and stored in the memory database, and the association analysis process is run to analyze the target heterogeneous data in the memory database, and the data association analysis results are obtained.
  • This scheme determines multiple heterogeneous data sources for storing corresponding heterogeneous data, extracts the corresponding metadata information of the heterogeneous data stored in each heterogeneous data source, and stores the metadata information in the corresponding multiple intermediate
  • the formed intermediate table is stored in the memory database
  • the metadata information in the memory database is associated with the data components in the association analysis process
  • the metadata information is used as the target metadata information
  • the target metadata information is used as the target heterogeneous data.
  • Run the association process to store the target heterogeneous data in the memory database. After the analysis, the data association analysis results are obtained, which realizes the rapid association analysis of heterogeneous data.
  • the above method can also form an association analysis process through the following steps, specifically including: displaying the user interface, responding to the user's component selection instruction, placing multiple components selected by the user in the component library on the user interface, responding Based on the user's component connection instruction, the corresponding components displayed in the user interface are connected to form an association analysis process.
  • the user interface is used for the user to deploy the association analysis process
  • the component selection instruction refers to the instruction sent by the user to instruct the terminal 100 to place multiple components selected by the user in the component library in the user interface
  • the component library It includes multiple components, and these components at least include data components
  • the component connection instruction refers to an instruction sent by the user to instruct the terminal 100 to connect corresponding components displayed in the user interface to form an association analysis process.
  • the terminal 100 displays a user interface for the user to deploy an association analysis process, and receives a component sent by the user for instructing the terminal 100 to place multiple components selected by the user in the component library, including at least data components, in the user interface.
  • selection instruction the terminal 100 responds to the component selection instruction and places multiple components selected by the user in the component library containing multiple components in the user interface, and then receives the instruction sent by the user to instruct the terminal 100 to display the corresponding components displayed in the user interface.
  • Components are connected to form a functional component connection instruction of an association analysis process, responding to the component connection instruction and connecting corresponding components displayed in the user interface to form an association analysis process.
  • the step of forming an association analysis process may further include: terminal 100 displaying a user interface, responding to a user's component selection instruction, placing multiple components selected by the user in the component library on the user interface, responding to the user's Component connection instruction, the terminal 100 connects one or more data components displayed in the user interface with one or more non-data components with a connection with a unidirectional arrow to form an association analysis sub-process, in response to the user triggering the non-data component at the end of the arrow
  • the sub-process operation instruction of the data component the terminal 100 runs the association analysis sub-process and displays the analysis process data, forms a new intermediate table for the analysis process data and stores it in the memory database, and the terminal 100 responds to the user's component connection instruction, the user interface
  • One or more data components shown in and one or more non-data components are connected by lines with unidirectional arrows to form multiple correlation analysis sub-processes
  • terminal 100 responds to the user's component connection instructions, and Multiple association analysis sub-processes are connected
  • the technical solution of this embodiment can form an association analysis process by selecting and connecting components on the user interface, so that the terminal 100 can accurately record and intuitively present the specific analysis steps of the target heterogeneous data in the data association analysis process, Thereby, it is beneficial to process the management of the data association analysis process and improve the analysis efficiency.
  • related detection processing can also be performed during the running of the correlation analysis process.
  • the specific steps include: during the running of the correlation analysis process, detecting the running status of the correlation analysis process and obtaining the analysis results generated in the correlation analysis process Process data, which stores the running status and analysis process data in the memory database.
  • the running state of the correlation analysis process refers to the running status of the correlation analysis process during the running process of the correlation analysis process, for example, whether there is any abnormality in the operation of the correlation analysis process, etc.; the analysis process data may include each component in the correlation analysis process. The data obtained after analyzing the output data of the connected previous component, etc.
  • the terminal 100 detects whether the running status of the correlation analysis process is abnormal, and obtains the analysis results generated in the correlation analysis process. process data, and then the terminal 100 stores the running status and analysis process data together in the memory database.
  • FIG. 4 is a schematic flow diagram of the correlation analysis process, wherein, during the operation process of the terminal 100 and the user's actions to run the correlation analysis process to analyze the target heterogeneous data, the terminal 100 detects whether the running status of the correlation analysis process is Abnormal, for example, when the running state of the correlation analysis process is that the correlation analysis process is running abnormally, the terminal 100 stops running the correlation analysis process and displays the error process node, obtains the analysis process data generated in the correlation analysis process and displays the running status and The analysis process data is stored in the memory database and displayed to the user. For example, when the running status of the correlation analysis process is that the correlation analysis process is running without exception, the terminal 100 runs the complete correlation analysis process to obtain the analysis generated in the correlation analysis process. process data, and then the terminal 100 stores the running status and analyzed process data in the memory database and displays it to the user, and obtains and displays the data association analysis results.
  • Abnormal for example, when the running state of the correlation analysis process is that the correlation analysis process is running abnormally, the terminal 100 stops running
  • the technical solution of this embodiment can detect the running status of the correlation analysis process and store the running status and analysis process data in the memory database, so that the terminal 100 can detect whether the correlation analysis process is running abnormally and facilitate the user to trace back the analysis process data And judge whether the data in the analysis process meets the user's requirements, so as to help improve the accuracy of the data association analysis results obtained after running the association analysis process.
  • a process data analysis method applied to multi-source heterogeneous data such as structured data and unstructured data is provided, as shown in Figure 4, the main steps include:
  • Step 1 Data import.
  • the data format can be unstructured data such as TXT, JSON, Excel, and CSV, or structured data such as database backup files.
  • the upload of structured data follows the traditional ETL method, while unstructured data formats such as Excel need to follow the rules of the system, and the system will provide corresponding template files to help users sort out data formats that comply with the rules.
  • the so-called system rules mean that for unstructured data such as TXT, Excel, and CSV, the field names and data content of their data columns will be placed according to the system's agreed rules, which is convenient for the system to analyze unstructured data in the background.
  • Step 2 Data preprocessing.
  • This step is mainly to analyze the unstructured data uploaded by the user, and call the data parser to identify the unstructured data.
  • This process will analyze the unstructured data in the unstructured data source according to the analysis rules specified in step 1.
  • the memory database here is mainly used for the temporary storage area, stores the intermediate tables generated during the data analysis process, and provides support for the association between tables. In addition, storing the data that the analyst cares about in the memory will help to speed up the execution.
  • Step 3 Form the intermediate table.
  • the so-called data integration module is a module that extracts part of the data in the data warehouse and stores it in the memory database.
  • this step will also extract metadata information in the data warehouse through the data integration module to form an intermediate table.
  • the intermediate table only contains the meta-information of the table in the data warehouse, and does not include the data content of the table. Only when the table data is used, the data is extracted to the memory database. By extracting metadata information and creating an intermediate table with the same name in the memory database, it is used to provide the basis for cross-database correlation query.
  • Step 4 Intermediate table management.
  • This step is mainly performed in the background, and the user is not aware of it.
  • This step will call the intermediate table management module in the system, which mainly maintains the information of all intermediate tables and presents them to the user in the form of a list.
  • Step 5 The component is associated with the intermediate table.
  • the components in this application refer to the packaging of the analysis functions used by the user in the data analysis process into functional entities, which are called components and are stored in the component library in a unified manner. It includes input data, output data, associated data, row splicing, formula, merge column, group aggregation, select column, row to column, column to row, filter row, deduplication, value replacement, Null value conversion and other functions. Users need to associate these functional components with the data in the intermediate table to realize the data analysis process.
  • the user drags the analysis component in the component library to the window, and clicks the component to configure the intermediate table.
  • the user can select the corresponding intermediate table to bind the component. After binding, click the component to preview the data of the table.
  • the intermediate table selected by the user in the associated data component comes from a certain field in the table in the data warehouse, the background will automatically call the data integration module to extract the data of the associated field that the user cares about and save it in the memory database for cross-database Association query.
  • Step 6 Form a sub-process.
  • the user does not need to wait until the entire process is drawn before running the results.
  • the user connects two components with a line with a one-way arrow to form a sub-process.
  • the direction of the arrow is the flow of data. Click the component at the end of the arrow to run and preview the result. At this time, the preview result will be saved in the background, which is convenient for subsequent tracing.
  • Step 7 Merge sub-processes.
  • Step 8 Execute the complete process.
  • a complete association analysis process can be run, during which the system will detect the execution status of all components of the entire process. If a component execution error occurs in a certain link due to a data conversion error, the operation will be stopped and the error process node will be identified. The user can roll back the previous sub-process and observe the result. After finding the problem data and adjusting the component parameters, the execution can continue.
  • Step 9 Analysis result output.
  • the system will present the analysis results in the form of a table.
  • the user can also combine some visual chart tools to display it, and the process is over.
  • the above application examples build a process-based analysis system, which solves the problem of difficult data association analysis in a multi-source heterogeneous data environment. It aims to lower the threshold of data analysis and traceability of the analysis process.
  • a data association analysis device is provided, and the device 500 may include:
  • a data determination module 501 configured to determine at least two types of heterogeneous data sources
  • the data extraction module 502 is configured to extract metadata information of heterogeneous data in each heterogeneous data source, form an intermediate table containing metadata information of various heterogeneous data, and store the intermediate table in the memory database;
  • a data association module 503, configured to use the metadata information associated with the data components in the association analysis process in the memory database as target metadata information;
  • the data storage module 504 is configured to acquire target heterogeneous data corresponding to the target metadata information from a corresponding heterogeneous data source according to the target metadata information, and store the target heterogeneous data in the memory database;
  • the data obtaining module 505 is configured to run the association analysis process to analyze the target heterogeneous data in the memory database, and obtain a data association analysis result.
  • the device 500 further includes: a data combination module, configured to display a user interface; the user interface is used for the user to deploy the association analysis process; in response to the user's component selection instruction, the component library The plurality of components selected by the user are placed in the user interface; the components include at least the data component; in response to the user's component connection instruction, the corresponding components displayed in the user interface are connected to form the The correlation analysis process described above.
  • a data combination module configured to display a user interface; the user interface is used for the user to deploy the association analysis process; in response to the user's component selection instruction, the component library The plurality of components selected by the user are placed in the user interface; the components include at least the data component; in response to the user's component connection instruction, the corresponding components displayed in the user interface are connected to form the The correlation analysis process described above.
  • the data association module 503 is configured to receive a user-triggered intermediate table configuration instruction of a data component in the association analysis process; respond to the intermediate table configuration instruction to display multiple intermediate tables;
  • the intermediate table selected by the user among the plurality of intermediate tables is used as the target intermediate table and the target intermediate table is configured for the data component;
  • the metadata information selected by the user in the target intermediate table is used as the target metadata information.
  • the data storage module 504 is configured to determine, from the at least two types of heterogeneous data sources, the heterogeneous data source to which the target metadata information belongs, as the target heterogeneous data source; The heterogeneous data corresponding to the target metadata information in the heterogeneous data source is used as the target heterogeneous data; the target heterogeneous data is obtained and stored in the memory database.
  • the device 500 further includes: a data upload module, configured to acquire at least two types of heterogeneous data uploaded by users; store the at least two types of heterogeneous data in different hard disk databases respectively, and obtain the At least two types of heterogeneous data sources.
  • a data upload module configured to acquire at least two types of heterogeneous data uploaded by users; store the at least two types of heterogeneous data in different hard disk databases respectively, and obtain the At least two types of heterogeneous data sources.
  • the at least two types of heterogeneous data sources include structured data sources and unstructured data sources; the at least two types of heterogeneous data include structured data and unstructured data.
  • the device 500 further includes: a data detection module, configured to detect the running status of the correlation analysis process and acquire the analysis process data generated in the correlation analysis process during the running process of the correlation analysis process; storing the running status and the analysis process data in the memory database.
  • a data detection module configured to detect the running status of the correlation analysis process and acquire the analysis process data generated in the correlation analysis process during the running process of the correlation analysis process; storing the running status and the analysis process data in the memory database.
  • Each module in the above-mentioned data association analysis device can be fully or partially realized by software, hardware and a combination thereof.
  • the above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure may be as shown in FIG. 6 .
  • the computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (Near Field Communication) or other technologies.
  • a data association analysis method is realized.
  • the display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.
  • FIG. 6 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute a limitation on the computer equipment to which the solution of this application is applied.
  • the specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • a computer device including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the above method embodiments when executing the computer program.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include Random Access Memory (RAM) or external cache memory.
  • RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据关联分析方法、装置、计算机设备和存储介质,涉及数据分析技术领域,能够实现对异构数据的快速关联分析,提高数据关联分析效率。该方法包括:确定至少两类异构数据源,提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中,将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息,根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中,运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。

Description

数据关联分析方法、装置、计算机设备和存储介质 技术领域
本申请涉及数据分析技术领域,特别是涉及一种数据关联分析方法、装置、计算机设备和存储介质。
背景技术
随着计算机技术的普及应用和迅猛发展,各信息系统所产生的数据正呈现爆炸性的增长,数据量大且异构是当今信息系统在存储数据时体现的特性。然而,这些特性加剧了数据孤岛的产生,导致这些数据往往很少能为例如企业的经营决策带来应有的价值。因此,寻求有效的数据关联分析技术来挖掘异构数据的潜在价值已经成为现实世界的迫切需求。
目前的技术中,数据分析的处理方式通常是建立统一的数据仓库进行数据收集,然后再基于数据仓库做数据分析。但在这种技术之下,例如Excel、报表等非结构化数据难以与数据库中的数据做关联分析,存在数据关联分析效率较低的技术问题。同时,这种方式要求分析人员需掌握SQL语法,这对仅会Excel等表格分析的人员造成了困扰。
发明内容
基于此,有必要针对上述技术问题,提供一种数据关联分析方法、装置、计算机设备和存储介质。
一种数据关联分析方法,所述方法包括:
确定至少两类异构数据源;
提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中;
将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;
根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目 标异构数据并存储在内存数据库中;
运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。
在其中一个实施例中,所述方法还包括:
展示用户界面;所述用户界面用于用户部署所述关联分析流程;
响应于所述用户的组件选取指令,将组件库中所述用户选取的多个组件放置在所述用户界面中;所述组件至少包括所述数据组件;
响应于所述用户的组件连接指令,将所述用户界面中展示的对应的组件相连形成所述关联分析流程。
在其中一个实施例中,所述中间表的数量为多个;所述将所述内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息,包括:
接收用户触发在所述关联分析流程中的数据组件的中间表配置指令;
响应于所述中间表配置指令,展示多个中间表;
将所述用户在所述多个中间表中选取的中间表作为目标中间表并为所述数据组件配置所述目标中间表;
将所述用户在所述目标中间表中选择的元数据信息作为所述目标元数据信息。
在其中一个实施例中,所述根据所述目标元数据信息,从相应的异构数据源获取所述目标元数据信息对应的目标异构数据并存储在所述内存数据库中,包括:
从所述至少两类异构数据源中确定所述目标元数据信息归属的异构数据源,作为目标异构数据源;
将所述目标异构数据源中与所述目标元数据信息对应的异构数据作为目标异构数据;
获取所述目标异构数据并存储在所述内存数据库中。
在其中一个实施例中,所述方法还包括:
获取用户上传的至少两类异构数据;
将所述至少两类异构数据分别存储在不同的硬盘数据库中,得到所述至少 两类异构数据源。
在其中一个实施例中,所述至少两类异构数据源包括结构化数据源和非结构化数据源;所述至少两类异构数据包括结构化数据和非结构化数据。
在其中一个实施例中,所述方法还包括:
所述关联分析流程运行过程中,检测所述关联分析流程的运行状态以及获取所述关联分析流程中产生的分析过程数据;
将所述运行状态和所述分析过程数据存储到所述内存数据库中。
一种数据关联分析装置,所述装置包括:
数据确定模块,用于确定至少两类异构数据源;
数据提取模块,用于提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中;
数据关联模块,用于将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;
数据存储模块,用于根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中;
数据得到模块,用于运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
确定至少两类异构数据源;提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中;将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中;运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。
一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
确定至少两类异构数据源;提取各异构数据源中异构数据的元数据信息, 形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中;将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中;运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。
上述数据关联分析方法、装置、计算机设备和存储介质,确定至少两类异构数据源,提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中,将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息,根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中,运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。该方案确定多个用于存储相应的异构数据的异构数据源,提取存储在各个异构数据源中的异构数据的对应的元数据信息,将元数据信息存储在相应的多个中间表中,将形成的中间表存储在内存数据库中,将内存数据库中的元数据信息与关联分析流程中的数据组件相关联,并将该元数据信息作为目标元数据信息,将目标元数据信息对应的异构数据作为目标异构数据,获取相应的异构数据源中的目标异构数据并将目标异构数据存储在内存数据库中,运行关联分析流程对内存数据库中的目标异构数据在内存中进行分析得到数据关联分析结果,从而实现对异构数据的快速关联分析,提高数据关联分析效率。
附图说明
图1为一个实施例中数据关联分析方法的应用环境图;
图2为一个实施例中数据关联分析方法的流程示意图;
图3为一个实施例中用于形成关联分析流程的界面示意图;
图4为另一个实施例中关联分析流程的流程示意图;
图5为一个实施例中数据关联分析装置的结构框图;
图6为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
需要说明的是,本申请所涉及的与用户相关的信息(包括但不限于用户设备信息、用户个人信息等)和数据及其处理(包括但不限于用于展示的数据、分析的数据等),均为经用户授权或者经过各方充分授权的信息和数据;对应的,本申请还为此提供有相应的用户授权入口,供用户选择授权或者选择拒绝。
本申请提供的数据关联分析方法,可以应用于图1所示的应用场景中,该应用场景可以包括:终端100和多个异构数据源,终端100和各异构数据源间可以进行通信连接。具体的,终端100确定至少两类异构数据源,然后终端100提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中,然后终端100将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息,然后终端100根据目标元数据信息,从相应的异构数据源获取所述目标元数据信息对应的目标异构数据并存储在内存数据库中,然后终端100运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。其中,终端100可以但不限于是各种个人计算机、笔记本电脑、智能手机和平板电脑。
在一个实施例中,如图2所示,提供了一种数据关联分析方法,以该方法应用于如图1所示的终端100为例进行说明,该方法可以包括以下步骤:
步骤S201,确定至少两类异构数据源。
其中,异构数据是指互为不同结构的数据,至少两类异构数据源是指两类或两类以上的用于分别存储互为不同结构的数据的数据源,例如至少两类异构数据源可以包括存储结构化数据的数据库和存储非结构化数据的数据库。
具体的,在用户需要终端100对至少两类异构数据进行数据关联分析前,终端100可先确定出用户需要分析的至少两类用于存储相应异构数据的异构数据源。
在一个实施例中,可以先通过如下步骤获得至少两类异构数据源,具体步骤包括:获取用户上传的至少两类异构数据,将至少两类异构数据分别存储在不同的硬盘数据库中,得到至少两类异构数据源。
其中,硬盘数据库是指存储在硬盘的数据库。具体的,终端100获取用户上传的至少两类异构数据,将至少两类异构数据分别存储在不同的硬盘数据库中,得到至少两类异构数据源。示例性的,在确定异构数据源之前,用户可上传结构化数据和非结构化数据,终端100接收用户以ETL(Extract-Transform-Load,抽取-转换-加载)方式上传的结构化数据并存储在硬盘上的结构化数据源中,接收用户上传的根据系统的模板文件将数据的数据列的字段名和数据内容区分放置的非结构化数据并存储在硬盘上的非结构化数据源中,将结构化数据源和非结构化数据源分别确定为不同类型的异构数据源。
本实施例的技术方案,能够通过将用户上传的至少两类异构数据存储在不同的硬盘数据库中从而得到至少两类异构数据源,使得终端100能够根据用户上传的多类异构数据的类型确定相应的多类异构数据源,从而使得终端100能够处理的异构数据类型更多样化,从而有利于终端100能够实现对更多不同类型的异构数据进行数据关联分析。
步骤S202,提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中。
本步骤中,元数据信息是指存储在中间表中的包含对应的异构数据的关键信息的数据,可用于关联至对应的异构数据。例如,某个异构数据是某本书的正文内容,元数据信息可以是书的目录,那么终端100可以根据书的目录(即元数据信息)找到书中对应的正文内容(即关联至对应的异构数据);中间表是存储在内存数据库中的数据表,主要用于存储异构数据对应的元数据信息,其中,内存数据库是指在内存中的数据库。例如,某个中间表可以是在内存中存储着书的目录(即元数据信息)的数据表。
具体的,终端100提取存储在各个异构数据源中的异构数据的对应的元数据信息,将提取的元数据信息存储在相应的多个中间表中,将形成的包含各类异构数据的元数据信息的中间表存储在内存数据库中。
步骤S203,将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息。
本步骤,如图3所示,关联分析流程是指对异构数据进行关联分析的流程,该关联分析流程可以由相互连接的一个或多个数据组件与一个或多个非数据组件构成。其中,数据组件是指在关联分析流程中的表示数据的组件,例如输入数据集组件、输出数据集组件等;非数据组件是指在关联分析流程中的表示对数据执行相应操作的组件,例如插入公式组件、合并列组件等;目标元数据信息是指与关联分析流程中的数据组件关联的元数据信息。具体的,终端100可以将内存数据库中的元数据信息与关联分析流程中的数据组件相关联,并将与数据组件相关联的元数据信息作为目标元数据信息,例如输入数据集组件需要关联某本书中的第三章节的目录,那么终端100可以将内存数据库中与该某本书中的第三章节的目录对应的元数据信息就是目标元数据信息。
在一个实施例中,上述步骤S203具体包括:接收用户触发在关联分析流程中的数据组件的中间表配置指令,响应于中间表配置指令,展示多个中间表,将用户在多个中间表中选取的中间表作为目标中间表并为数据组件配置目标中间表,将用户在目标中间表中选择的元数据信息作为目标元数据信息。
本实施例中,中间表的数量为多个;中间表配置指令是指用于指示终端100展示多个中间表以供用户从中选取作为数据组件配置的目标中间表的指令;目标中间表是指用户在多个中间表中选取的用于配置数据组件的中间表。
具体的,用户可对关联分析流程中的一个或多个数据组件触发中间表配置指令,终端100接收并响应该中间表配置指令,展示多个中间表以供用户选取,用户可在多个中间表中选取其需要的中间表,终端100将该用户选取的中间表作为目标中间表并为数据组件配置该目标中间表,然后用户可在目标中间表中选择一个或多个元数据信息,终端100将用户选择的元数据信息作为目标元数据信息。
本实施例的技术方案,能够通过用户在关联分析流程中对数据组件配置目标中间表并选择目标中间表中的一个或多个元数据信息作为目标元数据信息,使得终端100能够准确地记录和直观地呈现数据关联分析过程中的数据组件所 配置的目标中间表和目标元数据信息,有利于更快速地输入或替换需要进行数据关联分析的异构数据,从而有利于对数据关联分析过程进行流程化管理以及提高了分析效率。
步骤S204,根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中。
其中,目标异构数据是指原本存储在与目标元数据信息对应的异构数据源中的异构数据,例如当某个异构数据是某本书的正文内容,元数据信息是书的目录时,那么目标异构数据可以是某本书中的第三章节的正文内容。
在一个实施例中,上述步骤S204具体包括:从至少两类异构数据源中确定目标元数据信息归属的异构数据源,作为目标异构数据源,将目标异构数据源中与目标元数据信息对应的异构数据作为目标异构数据,获取目标异构数据并存储在内存数据库中。
其中,目标异构数据源是指前述至少两类异构数据源中,目标元数据信息归属的存储有目标异构数据的异构数据源。
具体的,终端100在多类异构数据源中查找并确定目标元数据信息归属的异构数据源,将该异构数据源作为目标异构数据源,然后终端100在目标异构数据源中查找并确定与目标元数据信息对应的异构数据,将该异构数据作为目标异构数据,然后终端100获取目标异构数据并将目标异构数据存储在内存数据库中。
本实施例的技术方案,能够通过根据目标元数据信息确定目标异构数据源从而确定目标异构数据并将目标异构数据存储在内存数据库中,使得终端100能够准确地获取用户需要的目标异构数据,从而有利于提高运行关联分析流程后得到的数据关联分析结果的准确性。
步骤S205,运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。
其中,数据关联分析结果是指经过运行关联分析流程对目标异构数据进行关联分析后得到的数据关联分析结果。
具体的,在包含数据组件的关联分析流程形成后,终端100可在用户的指 示下运行关联分析流程,对关联分析流程中的与数据组件关联的存储在内存数据库中的目标异构数据在内存中进行分析,在关联分析流程结束后得到数据关联分析结果。
上述数据关联分析方法中,确定至少两类异构数据源,提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将中间表存储在内存数据库中,将内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息,根据目标元数据信息,从相应的异构数据源获取目标元数据信息对应的目标异构数据并存储在内存数据库中,运行关联分析流程对内存数据库中的目标异构数据进行分析,得到数据关联分析结果。该方案确定多个用于存储相应的异构数据的异构数据源,提取存储在各个异构数据源中的异构数据的对应的元数据信息,将元数据信息存储在相应的多个中间表中,将形成的中间表存储在内存数据库中,将内存数据库中的元数据信息与关联分析流程中的数据组件相关联,并将该元数据信息作为目标元数据信息,将目标元数据信息对应的异构数据作为目标异构数据,获取相应的异构数据源中的目标异构数据并将目标异构数据存储在内存数据库中,运行关联流程对内存数据库中的目标异构数据在内存中进行分析,分析结束后得到数据关联分析结果,实现对异构数据的快速关联分析。
在一个实施例中,上述方法还可以通过如下步骤形成关联分析流程,具体包括:展示用户界面,响应于用户的组件选取指令,将组件库中用户选取的多个组件放置在用户界面中,响应于用户的组件连接指令,将用户界面中展示的对应的组件相连形成关联分析流程。
本实施例中,用户界面用于用户部署关联分析流程;组件选取指令是指用户发送的用于指示终端100将组件库中用户选取的多个组件放置在用户界面中的功能的指令;组件库包含多个组件,这些组件至少包括数据组件;组件连接指令是指用户发送的用于指示终端100将用户界面中展示的对应的组件相连形成关联分析流程的功能的指令。
具体的,终端100展示用于用户部署关联分析流程的用户界面,接收用户发送的用于指示终端100将组件库中用户选取的至少包括数据组件的多个组件 放置在用户界面中的功能的组件选取指令,终端100响应该组件选取指令并将包含多个组件的组件库中用户选取的多个组件放置在用户界面中,然后接收用户发送的用于指示终端100将用户界面中展示的对应的组件相连形成关联分析流程的功能的组件连接指令,响应该组件连接指令并将用户界面中展示的对应的组件相连形成关联分析流程。
在一些实施例中,形成关联分析流程的步骤还可以包括:终端100展示用户界面,响应于用户的组件选取指令,将组件库中用户选取的多个组件放置在用户界面中,响应于用户的组件连接指令,终端100将用户界面中展示的一个或多个数据组件与一个或多个非数据组件用带单向箭头的连线相连形成关联分析子流程,响应于用户触发在箭头末端的非数据组件的子流程运行指令,终端100运行关联分析子流程并展示分析过程数据,将分析过程数据形成新的中间表并存储到内存数据库中,终端100响应于用户的组件连接指令,将用户界面中展示的一个或多个数据组件与一个或多个非数据组件用带单向箭头的连线相连形成多个关联分析子流程,终端100响应于用户的组件连接指令,将用户界面中展示的多个关联分析子流程用带单向箭头的连线相连形成总的关联分析流程。
本实施例的技术方案,能够通过在用户界面上选取组件并连接组件形成关联分析流程,使得终端100能够准确地记录和直观地呈现数据关联分析过程中的对目标异构数据的具体分析步骤,从而有利于对数据关联分析过程进行流程化管理并提高了分析效率。
在一个实施例中,还可以在关联分析流程的运行过程中对其进行相关检测处理,具体步骤包括:关联分析流程运行过程中,检测关联分析流程的运行状态以及获取关联分析流程中产生的分析过程数据,将运行状态和分析过程数据存储到内存数据库中。
本实施例中,关联分析流程的运行状态是指关联分析流程运行过程中的关联分析流程的运行状态,例如关联分析流程的运行有无异常等;分析过程数据可以包括关联分析流程中各个组件对相连的上一个组件的输出数据进行分析后得到的数据等。
具体的,在终端100可在用户的指示下运行关联分析流程对目标异构数据进行分析的运行过程中,终端100检测关联分析流程的运行状态是否出现异常,并获取关联分析流程中产生的分析过程数据,然后终端100将运行状态和分析过程数据一并存储到内存数据库中。
进一步的,如图4为关联分析流程的流程示意图,其中,在终端100加上用户的动作运行关联分析流程对目标异构数据进行分析的运行过程中,终端100检测关联分析流程的运行状态是否异常,示例性的,当关联分析流程的运行状态是关联分析流程运行异常时,终端100停止运行关联分析流程并展示出错的流程节点,获取关联分析流程中产生的分析过程数据并将运行状态和分析过程数据存储到内存数据库中并展示给用户,示例性的,当关联分析流程的运行状态是关联分析流程运行无异常时,终端100运行完整的关联分析流程,获取关联分析流程中产生的分析过程数据,然后终端100将运行状态和分析过程数据存储到内存数据库中并展示给用户,得到并展示数据关联分析结果。
本实施例的技术方案,能够通过检测关联分析流程的运行状态并将运行状态和分析过程数据存储到内存数据库中,使得终端100能够检测关联分析流程是否运行异常和便于用户能够后续追溯分析过程数据并判断分析过程数据是否符合用户要求,从而有利于提高运行关联分析流程后得到的数据关联分析结果的准确性。
在一个应用实例中,提供了一种应用于可以兼容结构化数据和非结构化数据等多源异构数据的流程化数据分析方法,如图4所示,主要步骤包括:
步骤1:数据导入。
用户通过系统门户上传至少两类异构数据,数据格式可以是TXT、JSON、Excel、CSV等非结构化数据,也可以是数据库备份文件等结构化数据。结构化数据的上传遵循传统的ETL的方式,而Excel等非结构化数据格式需遵循系统的规则,系统会提供相应的模板文件帮助用户梳理成符合规则的数据格式。所谓系统规则,是指针对TXT、Excel、CSV等非结构化数据,会将它们数据列的字段名和数据内容按系统约定规则区分放置,便于系统后台解析非结构化数据。
步骤2:数据预处理。
这一步主要是解析用户上传的非结构化数据,并调用数据解析器识别非结构化数据,该过程会根据步骤1所指定的解析规则,解析出非结构化数据源中非结构化数据中的字段名和字段内容,并根据非结构化数据名称创建中间表,存储在内存数据库中。这里的内存数据库,主要用于临时存储区域,存放数据分析过程中产生的中间表,并为表与表之间的关联提供支撑。此外,将分析者关注的数据存放到内存中执行,有利于加快执行速度。
步骤3:形成中间表。
该步骤会使用到数据整合模块,所谓数据整合模块,就是把数据仓库中的部分数据抽取出来存放到内存数据库中的模块。
本步骤除了记录步骤2中创建的中间表之外,还会通过数据整合模块抽取数据仓库中的元数据信息,形成中间表。该中间表只包含数据仓库中表的元信息,不包括表的数据内容,只有在用到该表数据的时候才将数据抽取到内存数据库。通过抽取元数据信息并在内存数据库中创建同名的中间表,用于为跨库关联查询提供依据。
步骤4:中间表管理。
这一步主要在后台执行,用户无感知。本步骤将调用系统中的中间表管理模块,该模块主要维护所有中间表的信息,并以列表的形式展现给用户。
步骤5:组件与中间表关联。
本申请中的组件,是指把用户在数据分析过程中所用到的分析功能均封装成了一个个功能实体,并称之为组件统一保存在组件库中。它包括输入数据、输出数据、关联数据、行拼接、公式、合并列、分组聚合、选择列、行转列、列转行、筛选行、去重、值替换、Null值转换等功能。用户需将这些功能组件与中间表的数据关联起来以实现数据分析流程。
用户根据分析需求,拖动组件库中的分析组件到窗口中,并点击组件配置中间表,用户可以选择相应的中间表对组件进行绑定,绑定后点击组件可以预览表的数据。若用户使用关联数据组件中选择的中间表是来源于数据仓库中表的某个字段,则后台会自动调用数据整合模块,抽取用户关心的关联字段的数据到内存数据库中保存,用于跨库关联查询。
步骤6:形成子流程。
用户在组合拖动组件形成分析流程的过程中,并不需要等到画好全部流程后才运行出结果。用户用带单向箭头的连线连接两个组件形成一个子流程,箭头的指向即为数据的流向,点击箭头末端的组件运行预览结果。此时后台会保存预览结果,便于后续追溯。
步骤7:合并子流程。
用户运行各子流程后,若发现预览数据符合预期,则将所有的子流程连接在一起形成完整的分析流程,子流程输出的数据会根据流程箭头导向作为下一个子流程的输入。该功能主要依赖于系统中的可视化流程引擎模块,它主要为用户提供流程图的动态绘制功能以及后台数据根据流程箭头流转的功能机制。
步骤8:执行完整流程。
合并子流程后可运行完整关联分析流程,在这个过程中系统会检测整个流程所有组件的执行状态。若由于数据转换错误导致某一环节的组件执行出错,则停止运行并标识出错的流程节点,用户可回退上一子流程并观察其结果,待发现问题数据并调整组件参数后可继续执行。
步骤9:分析结果输出。
执行完整关联分析流程通过后,系统会以表格的形式呈现分析结果。此时用户亦可结合一些可视化图表工具对其进行展示,至此流程结束。
上述应用实例,构建了流程化分析系统,解决了多源异构数据环境下数据关联分析难度大的问题,同时以流程化、组件化、可视化的方式为非技术人员提供数据分析的便利,达到了降低数据分析门槛、分析流程可追溯的目的。
应该理解的是,虽然如上流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图5所示,提供了一种数据关联分析装置,该装置500可以包括:
数据确定模块501,用于确定至少两类异构数据源;
数据提取模块502,用于提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将所述中间表存储在内存数据库中;
数据关联模块503,用于将所述内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;
数据存储模块504,用于根据所述目标元数据信息,从相应的异构数据源获取所述目标元数据信息对应的目标异构数据并存储在所述内存数据库中;
数据得到模块505,用于运行所述关联分析流程对所述内存数据库中的所述目标异构数据进行分析,得到数据关联分析结果。
在一个实施例中,该装置500还包括:数据组合模块,用于展示用户界面;所述用户界面用于用户部署所述关联分析流程;响应于所述用户的组件选取指令,将组件库中所述用户选取的多个组件放置在所述用户界面中;所述组件至少包括所述数据组件;响应于所述用户的组件连接指令,将所述用户界面中展示的对应的组件相连形成所述关联分析流程。
在一个实施例中,数据关联模块503,用于接收用户触发在所述关联分析流程中的数据组件的中间表配置指令;响应于所述中间表配置指令,展示多个中间表;将所述用户在所述多个中间表中选取的中间表作为目标中间表并为所述数据组件配置所述目标中间表;将所述用户在所述目标中间表中选择的元数据信息作为所述目标元数据信息。
在一个实施例中,数据存储模块504,用于从所述至少两类异构数据源中确定所述目标元数据信息归属的异构数据源,作为目标异构数据源;将所述目标异构数据源中与所述目标元数据信息对应的异构数据作为目标异构数据;获取所述目标异构数据并存储在所述内存数据库中。
在一个实施例中,该装置500还包括:数据上传模块,用于获取用户上传的至少两类异构数据;将所述至少两类异构数据分别存储在不同的硬盘数据库 中,得到所述至少两类异构数据源。
在一个实施例中,所述至少两类异构数据源包括结构化数据源和非结构化数据源;所述至少两类异构数据包括结构化数据和非结构化数据。
在一个实施例中,该装置500还包括:数据检测模块,用于所述关联分析流程运行过程中,检测所述关联分析流程的运行状态以及获取所述关联分析流程中产生的分析过程数据;将所述运行状态和所述分析过程数据存储到所述内存数据库中。
关于数据关联分析装置的具体限定可以参见上文中对于数据关联分析方法的限定,在此不再赘述。上述数据关联分析装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种数据关联分析方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种数据关联分析方法,其特征在于,所述方法包括:
    确定至少两类异构数据源;
    提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将所述中间表存储在内存数据库中;
    将所述内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;
    根据所述目标元数据信息,从相应的异构数据源获取所述目标元数据信息对应的目标异构数据并存储在所述内存数据库中;
    运行所述关联分析流程对所述内存数据库中的所述目标异构数据进行分析,得到数据关联分析结果。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    展示用户界面;所述用户界面用于用户部署所述关联分析流程;
    响应于所述用户的组件选取指令,将组件库中所述用户选取的多个组件放置在所述用户界面中;所述组件至少包括所述数据组件;
    响应于所述用户的组件连接指令,将所述用户界面中展示的对应的组件相连形成所述关联分析流程。
  3. 根据权利要求1所述的方法,其特征在于,所述中间表的数量为多个;所述将所述内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息,包括:
    接收用户触发在所述关联分析流程中的数据组件的中间表配置指令;
    响应于所述中间表配置指令,展示多个中间表;
    将所述用户在所述多个中间表中选取的中间表作为目标中间表并为所述数据组件配置所述目标中间表;
    将所述用户在所述目标中间表中选择的元数据信息作为所述目标元数据信息。
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述目标元数据信息,从相应的异构数据源获取所述目标元数据信息对应的目标异构数据并存储在所述内存数据库中,包括:
    从所述至少两类异构数据源中确定所述目标元数据信息归属的异构数据源,作为目标异构数据源;
    将所述目标异构数据源中与所述目标元数据信息对应的异构数据作为目标异构数据;
    获取所述目标异构数据并存储在所述内存数据库中。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取用户上传的至少两类异构数据;
    将所述至少两类异构数据分别存储在不同的硬盘数据库中,得到所述至少两类异构数据源。
  6. 根据权利要求5所述的方法,其特征在于,所述至少两类异构数据源包括结构化数据源和非结构化数据源;所述至少两类异构数据包括结构化数据和非结构化数据。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述关联分析流程运行过程中,检测所述关联分析流程的运行状态以及获取所述关联分析流程中产生的分析过程数据;
    将所述运行状态和所述分析过程数据存储到所述内存数据库中。
  8. 一种数据关联分析装置,其特征在于,所述装置包括:
    数据确定模块,用于确定至少两类异构数据源;
    数据提取模块,用于提取各异构数据源中异构数据的元数据信息,形成包含各类异构数据的元数据信息的中间表并将所述中间表存储在内存数据库中;
    数据关联模块,用于将所述内存数据库中与关联分析流程中的数据组件关联的元数据信息作为目标元数据信息;
    数据存储模块,用于根据所述目标元数据信息,从相应的异构数据源获取所述目标元数据信息对应的目标异构数据并存储在所述内存数据库中;
    数据得到模块,用于运行所述关联分析流程对所述内存数据库中的所述目标异构数据进行分析,得到数据关联分析结果。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项 所述的方法的步骤。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。
PCT/CN2021/136435 2021-09-02 2021-12-08 数据关联分析方法、装置、计算机设备和存储介质 WO2023029275A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111024822.X 2021-09-02
CN202111024822.XA CN113688288B (zh) 2021-09-02 2021-09-02 数据关联分析方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023029275A1 true WO2023029275A1 (zh) 2023-03-09

Family

ID=78584991

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/136435 WO2023029275A1 (zh) 2021-09-02 2021-12-08 数据关联分析方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN113688288B (zh)
WO (1) WO2023029275A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166671A (zh) * 2023-04-21 2023-05-26 南方电网数字电网研究院有限公司 一种内存数据库表格预关联的处理方法、系统和介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688288B (zh) * 2021-09-02 2023-09-29 广州广电运通金融电子股份有限公司 数据关联分析方法、装置、计算机设备和存储介质
CN114844562A (zh) * 2022-04-28 2022-08-02 深圳市东晟数据有限公司 一种基于不同区域的光纤信号关联分析方法
CN115562192B (zh) * 2022-09-27 2023-06-27 北京虎蜥信息技术有限公司 一种装配工艺图形化管理方法、系统、终端及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120747A1 (en) * 2013-10-30 2015-04-30 Netapp, Inc. Techniques for searching data associated with devices in a heterogeneous data center
CN104933095A (zh) * 2015-05-22 2015-09-23 中国电子科技集团公司第十研究所 异构信息通用性关联分析系统及其分析方法
CN110837585A (zh) * 2019-11-07 2020-02-25 中盈优创资讯科技有限公司 多源异构的数据关联查询方法及系统
CN113688288A (zh) * 2021-09-02 2021-11-23 广州广电运通金融电子股份有限公司 数据关联分析方法、装置、计算机设备和存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11487730B2 (en) * 2017-07-11 2022-11-01 International Business Machines Corporation Storage resource utilization analytics in a heterogeneous storage system environment using metadata tags
CN111177244A (zh) * 2019-12-24 2020-05-19 四川文轩教育科技有限公司 面向多个异构数据库的数据关联分析方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150120747A1 (en) * 2013-10-30 2015-04-30 Netapp, Inc. Techniques for searching data associated with devices in a heterogeneous data center
CN104933095A (zh) * 2015-05-22 2015-09-23 中国电子科技集团公司第十研究所 异构信息通用性关联分析系统及其分析方法
CN110837585A (zh) * 2019-11-07 2020-02-25 中盈优创资讯科技有限公司 多源异构的数据关联查询方法及系统
CN113688288A (zh) * 2021-09-02 2021-11-23 广州广电运通金融电子股份有限公司 数据关联分析方法、装置、计算机设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166671A (zh) * 2023-04-21 2023-05-26 南方电网数字电网研究院有限公司 一种内存数据库表格预关联的处理方法、系统和介质
CN116166671B (zh) * 2023-04-21 2023-08-15 南方电网数字电网研究院有限公司 一种内存数据库表格预关联的处理方法、系统和介质

Also Published As

Publication number Publication date
CN113688288B (zh) 2023-09-29
CN113688288A (zh) 2021-11-23

Similar Documents

Publication Publication Date Title
WO2023029275A1 (zh) 数据关联分析方法、装置、计算机设备和存储介质
US11989707B1 (en) Assigning raw data size of source data to storage consumption of an account
US11308092B2 (en) Stream processing diagnostics
US10853382B2 (en) Interactive punchcard visualizations
US10122783B2 (en) Dynamic data-ingestion pipeline
US10810074B2 (en) Unified error monitoring, alerting, and debugging of distributed systems
US20150046512A1 (en) Dynamic collection analysis and reporting of telemetry data
US9356966B2 (en) System and method to provide management of test data at various lifecycle stages
US20130124957A1 (en) Structured modeling of data in a spreadsheet
US9946702B2 (en) Digital processing system for transferring data for remote access across a multicomputer data network and method thereof
US10175954B2 (en) Method of processing big data, including arranging icons in a workflow GUI by a user, checking process availability and syntax, converting the workflow into execution code, monitoring the workflow, and displaying associated information
US9009175B2 (en) System and method for database migration and validation
US9098497B1 (en) Methods and systems for building a search service application
US20210200939A1 (en) Document conversion, annotation, and data capturing system
CN107301214A (zh) 在hive中数据迁移方法、装置及终端设备
US20220245093A1 (en) Enhanced search performance using data model summaries stored in a remote data store
US20140279972A1 (en) Cleansing and standardizing data
CN110941629A (zh) 元数据处理方法、装置、设备及计算机可读存储介质
CN111221698A (zh) 任务数据采集方法与装置
CN115357590A (zh) 针对数据变更的记录方法、装置、电子设备及存储介质
US20200201853A1 (en) Collecting query metadata for application tracing
Asuncion Automated data provenance capture in spreadsheets, with case studies
CN113962597A (zh) 一种数据分析方法、装置、电子设备及存储介质
US10824803B2 (en) System and method for logical identification of differences between spreadsheets
US11841827B2 (en) Facilitating generation of data model summaries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21955803

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE