CN116738356A - Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium - Google Patents

Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium Download PDF

Info

Publication number
CN116738356A
CN116738356A CN202310514985.9A CN202310514985A CN116738356A CN 116738356 A CN116738356 A CN 116738356A CN 202310514985 A CN202310514985 A CN 202310514985A CN 116738356 A CN116738356 A CN 116738356A
Authority
CN
China
Prior art keywords
data
source
data source
target
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310514985.9A
Other languages
Chinese (zh)
Inventor
张希翔
蒙琦
董贇
艾徐华
黄汉华
周迪贵
古哲德
覃宁
陶思恒
孟椿智
谢菁
谭期文
韦宗慧
宁梓宏
孟春辰
陈燕雁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Power Grid Co Ltd
Original Assignee
Guangxi Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Power Grid Co Ltd filed Critical Guangxi Power Grid Co Ltd
Priority to CN202310514985.9A priority Critical patent/CN116738356A/en
Publication of CN116738356A publication Critical patent/CN116738356A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a multi-source heterogeneous data acquisition and fusion method, equipment and a storage medium, wherein the method comprises the steps of receiving a user request; determining an optimal query processing strategy according to the identity parameters of the data to be accessed and the access mode of the data to be accessed; acquiring an optimization processing strategy according to the access mode of the data to be accessed; acquiring the target virtual table according to the optimal query processing strategy, and finding out the target data source from a plurality of initial data sources according to the target virtual table and the query scheme; and determining the query range in the target data source according to the optimization scheme, accessing the query range in the target data source in the search mode, and returning an access result. The embodiment realizes a data access method which does not care about data forms, related technology interfaces and data storage positions, and can shield the bottom data difference.

Description

Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, and a storage medium for multi-source heterogeneous data acquisition and fusion.
Background
In recent years, with the high-speed development of high-new technologies such as artificial intelligence, big data, smart grids and the like, the industry prospect of the big data becomes a serious test and precious opportunity facing each power enterprise. The power industry can be regarded as a highly complex nonlinear power system, and the intelligent degree of the application occasion of the power system is directly determined by the processing capacity of the power grid operation heterogeneous data.
The multi-source heterogeneous big data in the current power grid system mainly originate from four aspects:
firstly, acquiring data of massive power grid running state information, wherein the voltage, phase angle, frequency, fault information and the like of each line are recorded; the operation state data of the mass power equipment comprises state monitoring data, historical data, fault record data and the like of the power equipment such as a generator, a circuit breaker, a transformer and the like, but the equipment data fed back by the power equipment are difficult to unify due to the inconsistency of the model, the voltage level, the application occasion, the manufacturer and the like of the power equipment; thirdly, the application of the image recognition technology is mature, the image recognition technology is promoted in an electric power system greatly, and massive data are generated by replacing the traditional manual inspection; fourthly, recording a defect text log, wherein the recorded data format structure is slightly different due to the difference of language expression capability and writing habit of the patrol personnel, so that the difference between the recorded data format structure and historical data is larger.
These power multisource heterogeneous data present a significant challenge to the intelligent processing of power information. And how to collect and fuse multi-source heterogeneous information data in a complex information system such as a power system is attracting attention of researchers at home and abroad.
Therefore, there is a need for a multi-source heterogeneous data collection and fusion method that does not care about data form, related art interfaces, and data storage locations.
Disclosure of Invention
In order to solve the above problems, embodiments of the present invention provide a multi-source heterogeneous data collection and fusion method, system, device and storage medium that overcome or at least partially solve the above problems.
According to a first aspect of an embodiment of the present invention, there is provided a multi-source heterogeneous data acquisition and fusion method, including:
receiving a user request, wherein the user request comprises data to be accessed and an access mode of the data to be accessed, and the access mode comprises inquiry, writing and reading;
determining an optimal query processing strategy according to the identity parameters of the data to be accessed and the access mode of the data to be accessed, wherein the optimal query processing strategy comprises a target virtual table corresponding to the data to be accessed and a query scheme for the target virtual table, the query scheme comprises a mapping relationship between a target data source where the data to be accessed is located and the target virtual table, the target virtual table is one of a plurality of preset virtual tables, the preset virtual tables are obtained after different preset encapsulation tables are embedded, the preset encapsulation table is obtained by encapsulating an initial data source, and the target data source is one of a plurality of initial data sources;
according to the access mode of the data to be accessed, an optimization processing strategy is obtained, wherein the optimization processing strategy comprises an optimization scheme for optimizing the query scheme and improving the access efficiency, and the optimization scheme comprises a query range and a search mode;
acquiring the target virtual table according to the optimal query processing strategy, and finding out the target data source from a plurality of initial data sources according to the target virtual table and the query scheme;
and determining the query range in the target data source according to the optimization scheme, accessing the query range in the target data source in the search mode, and returning an access result.
Further, the target data source is found from a plurality of initial data sources according to the target virtual table and the query scheme, the query scheme includes a first mapping relationship and a second mapping relationship, and the method includes:
searching in a plurality of preset encapsulation tables according to the target virtual table and the first mapping relation, and acquiring the target encapsulation table if the target encapsulation table can be found in the plurality of preset encapsulation tables;
if the target encapsulation table cannot be found in a plurality of preset encapsulation tables, directly generating a temporary encapsulation table, and taking the temporary encapsulation table as the target encapsulation table;
and searching in a plurality of initial data sources according to the target encapsulation table and the second mapping relation to acquire the target data source.
Further, the initial data source is obtained by:
configuring an adapter corresponding to each bottom data source type according to the type of each bottom data source;
a unified adapter interface is arranged between the adapter corresponding to each bottom data source type and all the bottom data sources;
and calling the unified adapter interface through the adapter corresponding to each bottom data source type, and carrying out data virtualization on the data information in each bottom data source to obtain a plurality of initial data sources.
Further, the types of the underlying data sources include a relational database, a NoSQL database, a Word format document, an Excel format document, a rest pi service and a web page, the adapter corresponding to the relational database is a relational database adapter, the adapter corresponding to the NoSQL database is a NoSQL database adapter, the adapter corresponding to the Word format document is a Word format document adapter, the adapter corresponding to the Excel format document is an Excel format document adapter, the adapter corresponding to the rest pi service is a rest pi service adapter, and the adapter corresponding to the web page is a web crawler adapter.
Further, the calling the unified adapter interface through the adapter corresponding to each bottom data source type, performing data virtualization on the data information in each bottom data source, and obtaining a plurality of initial data sources includes:
for the adapter corresponding to each bottom data source type, calling the unified adapter interface through the adapter corresponding to each bottom data source type, and importing initial data in each bottom data source;
and converting the initial data in each imported underlying data source into initial data in a standard form, and acquiring a plurality of initial data sources.
Further, the step of converting the initial data in each imported underlying data source into initial data in a standard form, and obtaining a plurality of initial data sources further includes:
extracting initial data in a standard form, acquiring metadata information, and packaging the metadata information to obtain a plurality of preset packaging tables, wherein the preset packaging tables comprise the following information:
storing network position information of a server of the bottom data source;
logging in connection information of a database corresponding to the bottom data source, wherein the connection information comprises a database driver, a URL, a user name and a password;
the name, owner and creation date of the underlying data source;
the structure of the underlying data source comprises each column name and annotation of the source table;
the definition of each column in the bottom data source comprises a data type, a main key and a null;
the available primary and foreign keys defined in the underlying data source;
distribution information of the number of columns in the underlying data source and the value of each column, the distribution information being extracted for query optimization;
and recording line number information and occupied storage information in the bottom data source.
Further, the extracting the initial data in the standard form, obtaining metadata information, and packaging the metadata information to obtain a plurality of preset packaging tables, including:
determining a mapping corresponding to each preset encapsulation table, wherein the mapping comprises row selection, column connection, conversion, column and change of table names;
and converting each preset encapsulation table through the mapping corresponding to each preset encapsulation table to obtain each preset virtual table.
Further, the accessing the query range in the target data source by the searching mode, and returning an access result includes:
and optimizing and executing the query statement corresponding to the search mode in a data federation query or real-time mirroring mode, extracting data from the cache or the target data source end, merging and assembling, and returning the access result in a data format contained in the user request.
According to a second aspect of an embodiment of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing any of the multi-source heterogeneous data collection and fusion methods as provided in the first aspect when executing the program.
According to a fourth aspect of embodiments of the present invention, there is provided a non-transitory computer readable storage medium, which when executed by a processor, implements any of the multi-source heterogeneous data collection and fusion methods as provided in the first aspect.
The embodiment of the invention provides a multi-source heterogeneous data acquisition and fusion method, equipment and a storage medium, wherein when the multi-source heterogeneous data acquisition and fusion method, equipment and the storage medium are required to be used, a user request is directly sent, and an optimal processing strategy and a target virtual table are determined according to user request information; according to the access mode of the data to be accessed, an optimized processing strategy is obtained, the search range can be reduced by optimizing the query range and the search mode in the processing strategy, and the proper search mode is adopted, so that the search efficiency is improved; and finally, azimuth is carried out on the query range of the target data source according to the search mode, and the access result is determined.
The target virtual table is obtained by embedding and packaging different initial data sources, so that the method can collect the bottom data sources from different services, different systems and different architectures, integrate and abstract the bottom data sources to define new data objects, and finally realize data exchange and data fusion of the bottom data sources;
when a user accesses data, the user does not need to know a data interface used by the underlying data, does not need to relate to a data form and does not need to go to a storage position of the data, and the data to be accessed can be accessed only by sending a corresponding user request.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flowchart of a method for multi-source heterogeneous data collection and fusion according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for multi-source heterogeneous data collection and fusion according to another embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Fig. 1 is a flowchart of a method for collecting and fusing multi-source heterogeneous data according to an embodiment of the present invention, as shown in fig. 1, where the method includes:
s110, receiving a user request, wherein the user request comprises data to be accessed and an access mode of the data to be accessed, and the access mode comprises inquiry, writing and reading;
first, a user request is received, where the user request may be a request for querying data to be accessed, a request for writing data to be accessed, or a request for reading data to be accessed.
The user request comprises data to be accessed and an access mode of the data to be accessed, wherein the access mode is one or more of inquiry, writing and reading, when the data to be accessed is inquiry or reading, the data to be accessed can be only inquiry condition, inquiry ID or reading ID, and when the data to be accessed is written, the data to be accessed can be a data value to be written.
S120, determining an optimal query processing strategy according to the identity parameters of the data to be accessed and the access mode of the data to be accessed, wherein the optimal query processing strategy comprises a target virtual table corresponding to the data to be accessed and a query scheme for the target virtual table, the query scheme comprises a target data source where the data to be accessed is located and a mapping relation between the target virtual table, the target virtual table is one of a plurality of preset virtual tables, the preset virtual table is obtained after different preset encapsulation tables are embedded, the preset encapsulation table is obtained by encapsulating an initial data source, and the target data source is one of a plurality of initial data sources;
when a user initiates a user request, the query engine determines the optimal query processing strategy and performance optimization measures, and performs query result calculation, optimization and result response. The optimal query processing strategy is that the system gives an execution scheme and a flow according to the access mode of the user query request to the target data.
The optimal query processing strategy comprises a target virtual table corresponding to data to be accessed and a query scheme for the target virtual table, wherein the query scheme comprises a mapping relation between a target data source where the data to be accessed is located and the target virtual table, the target virtual table is one of a plurality of preset virtual tables, the preset virtual tables are obtained after different preset encapsulation tables are embedded, the preset encapsulation table is obtained by encapsulating an initial data source, and the target data source is one of the plurality of initial data sources.
S130, acquiring an optimization strategy according to the access mode of the data to be accessed, wherein the optimization strategy comprises an optimization scheme for optimizing the query scheme to improve the access efficiency, and the optimization scheme comprises a query range and a search mode;
the performance optimization measure is to make optimization to the query process to improve the query efficiency after the system determines the data access mode. The query optimization process determines the best processing strategy for a query, and various techniques can be deployed for optimizing the query entered by the user.
When working with languages such as SQL, MDX, XSTL and XQuery, developers need only to ascertain what data they need. They do not need to state how data is retrieved from the data store. This is why these languages are sometimes referred to as declarative languages. For example, in one of the following SQL queries, a consumer is queried based on Tulsa:
SELECT*
FROMCUSTOMER
WHERECITY_NAME=‘Tulsa’
in this statement, there is no way to specify where the CUSTOMER table is stored and whether to use a search pair to search for a row. In one declarative language, it is only necessary to specify "what" rather than "how to do. Finding the best way to acquire the data is the responsibility of the database server, which is called the processing policy. The module responsible for this task is called an optimizer. The better an optimization can determine the execution strategy, the better the performance of the query.
In order to determine the best processing strategy, each optimizer needs to consider the expected number of I/Os and processing time. Other aspects also need to be considered by the optimizer of the multi-source heterogeneous data intelligent fusion model system, so that the optimizer is more complex. First, the required data may be stored in multiple data stores and the data of these data stores needs to be integrated. Second, these data stores may use a different language and API than the declarations in the query language, so (a portion of) the input query may need to be converted to another query language, such as SQL to XQuery. Thirdly, optimizing the total amount of data to be transmitted between the data storage area and the multi-source heterogeneous data intelligent fusion model system. An optimizer of one database server need not address the 3 aspects of problems described above, but an optimizer within a multi-source heterogeneous data intelligent fusion model system needs to address the 3 aspects of problems.
S140, acquiring the target virtual table according to the optimal query processing strategy, and finding out the target data source from a plurality of initial data sources according to the target virtual table and the query scheme;
and according to the optimal query processing strategy, a target virtual table is found, and according to the target virtual table and the query scheme, a target data source is found from the birth data source.
And S150, determining the query range in the target data source according to the optimization scheme, accessing the query range in the target data source in the search mode, and returning an access result.
Finally, determining a query scheme in the target data source by utilizing an optimization scheme, and reducing the search range and adopting a proper search mode by optimizing the query range and the search mode in the processing strategy so as to improve the search efficiency; and finally, azimuth is carried out on the query range of the target data source according to the search mode, and the access result is determined.
The multi-source heterogeneous data acquisition and fusion method provided by the embodiment of the invention directly sends a user request when the multi-source heterogeneous data acquisition and fusion method is needed to be used, and determines an optimal processing strategy and a target virtual table according to user request information; according to the access mode of the data to be accessed, an optimized processing strategy is obtained, the search range can be reduced by optimizing the query range and the search mode in the processing strategy, and the proper search mode is adopted, so that the search efficiency is improved; and finally, azimuth is carried out on the query range of the target data source according to the search mode, and the access result is determined.
The target virtual table is obtained by embedding and packaging different initial data sources, so that the method can collect the bottom data sources from different services, different systems and different architectures, integrate and abstract the bottom data sources to define new data objects, and finally realize data exchange and data fusion of the bottom data sources;
when a user accesses data, the user does not need to know a data interface used by the underlying data, does not need to relate to a data form and does not need to go to a storage position of the data, and the data to be accessed can be accessed only by sending a corresponding user request.
In some embodiments, in step S140, the searching the target data source from the plurality of initial data sources according to the target virtual table and the query scheme includes a first mapping relationship and a second mapping relationship, including:
s141, searching in a plurality of preset encapsulation tables according to the target virtual table and the first mapping relation, and acquiring the target encapsulation table if the target encapsulation table can be found in the plurality of preset encapsulation tables;
s142, if the target encapsulation table cannot be found in a plurality of preset encapsulation tables, directly generating a temporary encapsulation table, and taking the temporary encapsulation table as the target encapsulation table;
s143, searching in a plurality of initial data sources according to the target encapsulation table and the second mapping relation to obtain the target data source.
The user does not know how the data is stored nor where the data is stored. The data is stored in an SQL database or a spreadsheet or is extracted from a website, which is hidden from the user.
Various APIs are provided in this embodiment for accessing these tables. A user may access the virtual table using a conventional JDBC/SQL interface, MDX, or SOAP-based interface. The first user takes the data as the original table, the MDX application will see the multidimensional data, and using the SOAP base page, the returned data will be in the format of an HTML file. Regardless of the form of the data, the user sees the same data, but does not know where and how the data in the table is stored. The mapping from the virtual table to the initial data source is realized, so that the intelligent delivery of correct data to the user of the multi-source heterogeneous data is ensured.
Since the various original data source owners may open all or part of the data, these open data may be raw data, and more processed data. The encapsulation table corresponds to different initial data sources and realizes the interface encapsulation of open source data; the definition of the virtual table is built on the encapsulation table, the virtual tables can be combined and nested, and the virtual table can be issued as a data service after the definition. The data service focuses more on the acquisition and integration modes of data resources, and the definition of the virtual table focuses on the data itself, so that the underlying data required by the data service can be presented in the mode of the virtual table.
The mapping from the preset virtual table to the initial data source is realized, so that the multi-source heterogeneous data intelligent fusion model system is ensured to deliver correct data to data consumers. The relationship between the preset virtual table, the mapping and the preset encapsulation table should be understood.
The preset virtual table is based on a preset encapsulation table, and the preset encapsulation table is based on an initial data source. The relationship between the preset encapsulation table and the initial data source is many-to-one, and one or more preset encapsulation tables can be defined according to one initial data source. The process of defining the preset virtual table is also a process of defining the mapping, and defines the preset virtual table on the basis of the preset encapsulation table. The mapping corresponds to a query definition for the preset virtual table, including the structure of the preset virtual table (row, column selection, column conversion, table name change, grouping, etc.), how the data is converted into the content of the preset virtual table, etc.
If there is no mapping, the preset virtual table is an empty table without contents. Therefore, to ensure a correct mapping, the relationships between the data in the preset encapsulation table must be correctly analyzed, ensuring that the definition from the initial data source to the preset encapsulation table to the preset virtual table is accurate.
In this embodiment, the query scheme includes a first mapping relationship and a second mapping relationship, where the first mapping relationship refers to a mapping relationship between the target virtual table and the target encapsulation table, and the second mapping relationship refers to a mapping relationship between the target encapsulation table and the target data source.
In some embodiments, fig. 2 is a flowchart of a method for multi-source heterogeneous data collection and fusion according to another embodiment of the present invention, as shown in fig. 2, where the initial data source is obtained by:
configuring an adapter corresponding to each bottom data source type according to the type of each bottom data source;
a unified adapter interface is arranged between the adapter corresponding to each bottom data source type and all the bottom data sources;
and calling the unified adapter interface through the adapter corresponding to each bottom data source type, and carrying out data virtualization on the data information in each bottom data source to obtain a plurality of initial data sources.
If the target virtual table corresponding to the query is not predefined, organizing related metadata required by the query according to metadata information stored by the system, and generating a corresponding temporary virtual table. And mapping the target virtual table and the target encapsulation table is implemented, so that the underlying data source is accessed.
In some embodiments, the types of the underlying data sources include a relational database, a NoSQL database, a Word format document, an Excel format document, a rest pi service, and a web page, where an adapter corresponding to the relational database is a relational database adapter, an adapter corresponding to the NoSQL database is a NoSQL database adapter, an adapter corresponding to the Word format document is a Word format document adapter, an adapter corresponding to the Excel format document is an Excel format document adapter, an adapter corresponding to the rest pi service is a rest pi service adapter, and an adapter corresponding to the web page is a web crawler adapter.
In some embodiments, the calling the unified adapter interface through the adapter corresponding to each bottom data source type performs data virtualization on the data information in each bottom data source to obtain a plurality of initial data sources, including:
for the adapter corresponding to each bottom data source type, calling the unified adapter interface through the adapter corresponding to each bottom data source type, and importing initial data in each bottom data source;
and converting the initial data in each imported underlying data source into initial data in a standard form, and acquiring a plurality of initial data sources.
Data virtualization techniques help to reduce reliance on physical storage systems and provide a uniform interface for all applications that use data, particularly business intelligence systems, analytics systems, and transactional systems. The data virtualization technology has the advantage of no substitution, and can fully meet the requirements of complex multi-source heterogeneous information acquisition and data fusion in the informationized environment of the target power grid.
The underlying data sources are raw data provided by the systems, which may be structured or unstructured; the method can be from a relational database or a non-relational database, and has various types and structures.
And unified interface management is carried out on multi-source heterogeneous data sources provided by owners of all the bottom data sources, so that access details of various different data sources are realized to shield users. And obtaining and transmitting source data through interfaces such as ODBC/JDBC, JSON, API and the like, and finally completing the delivery of data resources required by a user.
Of particular note is: the bottom data source only manages the access interfaces of various physical data sources, and does not need to know the specific organization, storage and management modes of the physical source data; the physical data sources are managed by the owners and open views of all or part of the source data according to their own policies.
For multi-source heterogeneous data, the connection between the multi-source heterogeneous data and the intelligent fusion model system of the multi-source heterogeneous data is realized through a unified multi-source heterogeneous acquisition adapter. External data resources that are connected through a unified adapter interface are collectively referred to as the underlying data sources. And respectively calling the relational database adapter, the NoSQL database adapter, the Word format document adapter, the Excel format document adapter, the RESTAPI service adapter and the webpage crawler adapter by the unified adapter interface according to the different data types of the underlying data sources to read and analyze the related data.
In some embodiments, the transforming the initial data in each imported underlying data source into initial data in a standard form, obtaining a plurality of initial data sources, and then further includes:
extracting initial data in a standard form, acquiring metadata information, and packaging the metadata information to obtain a plurality of preset packaging tables, wherein the preset packaging tables comprise the following information:
storing network position information of a server of the bottom data source;
logging in connection information of a database corresponding to the bottom data source, wherein the connection information comprises a database driver, a URL, a user name and a password;
the name, owner and creation date of the underlying data source;
the structure of the underlying data source comprises each column name and annotation of the source table;
the definition of each column in the bottom data source comprises a data type, a main key and a null;
the available primary and foreign keys defined in the underlying data source;
distribution information of the number of columns in the underlying data source and the value of each column, the distribution information being extracted for query optimization;
and recording line number information and occupied storage information in the bottom data source.
When a source table is imported, a package table or short package is required. Some products may use other names, such as underlying perspectives and perspectives. Only very little metadata is extracted during import and stored in its dictionary by the multi-source heterogeneous data intelligent fusion model system. All of this metadata information is assigned to the definition of the encapsulation table. The encapsulation table contains all metadata information corresponding to the underlying data source, possibly including the content described above.
In some embodiments, the accessing the query scope in the target data source by the search method, and returning the access result includes:
and optimizing and executing the query statement corresponding to the search mode in a data federation query or real-time mirroring mode, extracting data from the cache or the target data source end, merging and assembling, and returning the access result in a data format contained in the user request.
The data is accessed uniformly by connecting the data sources, the data is accelerated by adopting a cache, and the data is not fully responsible. When the application calls a data request, the federal query optimizes and executes the query statement, extracts data from the cache or the data source end, merges and assembles the data, and returns a data result in a format required by the application.
The advantages are that: the full copy is not needed, and the hardware cost is low;
disadvantages: the query is performed in an intrusion way to the data source system, the linkage with the source system is needed, the response delay is high, and the delay promise of the query response cannot be realized.
The data virtualization is realized in a real-time mirror image mode, a centralized data storage library is required to be established, and the data of all data sources are 1:1, mirrored to a central data store, and then data modeling and unified management based on the central data store.
The advantages are that: the method has no or less influence on the source system, and can realize the sub-second query response;
disadvantages: requiring additional storage costs;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 3, where the electronic device includes: processor 301, communication interface (communication interface) 302, memory 303 and communication bus 304, wherein processor 301, communication interface 302, and memory 303 communicate with each other through communication bus 304. The processor 301 may call a computer program on the memory 303 and executable on the processor 301 to perform the methods for multi-source heterogeneous data collection and fusion provided by the above embodiments.
Further, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods for multi-source heterogeneous data collection and fusion provided by the above embodiments.
The above-described embodiments of electronic devices and the like are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or some part of the methods of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A multi-source heterogeneous data acquisition and fusion method, comprising:
receiving a user request, wherein the user request comprises data to be accessed and an access mode of the data to be accessed, and the access mode comprises inquiry, writing and reading;
determining an optimal query processing strategy according to the identity parameters of the data to be accessed and the access mode of the data to be accessed, wherein the optimal query processing strategy comprises a target virtual table corresponding to the data to be accessed and a query scheme for the target virtual table, the query scheme comprises a mapping relationship between a target data source where the data to be accessed is located and the target virtual table, the target virtual table is one of a plurality of preset virtual tables, the preset virtual tables are obtained after different preset encapsulation tables are embedded, the preset encapsulation table is obtained by encapsulating an initial data source, and the target data source is one of a plurality of initial data sources;
according to the access mode of the data to be accessed, an optimization processing strategy is obtained, wherein the optimization processing strategy comprises an optimization scheme for optimizing the query scheme and improving the access efficiency, and the optimization scheme comprises a query range and a search mode;
acquiring the target virtual table according to the optimal query processing strategy, and finding out the target data source from a plurality of initial data sources according to the target virtual table and the query scheme;
and determining the query range in the target data source according to the optimization scheme, accessing the query range in the target data source in the search mode, and returning an access result.
2. The multi-source heterogeneous data collection and fusion method of claim 1, wherein the target data source is found from a plurality of initial data sources according to the target virtual table and the query scheme, the query scheme including a first mapping relationship and a second mapping relationship, comprising:
searching in a plurality of preset encapsulation tables according to the target virtual table and the first mapping relation, and acquiring the target encapsulation table if the target encapsulation table can be found in the plurality of preset encapsulation tables;
if the target encapsulation table cannot be found in a plurality of preset encapsulation tables, directly generating a temporary encapsulation table, and taking the temporary encapsulation table as the target encapsulation table;
and searching in a plurality of initial data sources according to the target encapsulation table and the second mapping relation to acquire the target data source.
3. The multi-source heterogeneous data collection and fusion method of claim 1, wherein the initial data source is obtained by:
configuring an adapter corresponding to each bottom data source type according to the type of each bottom data source;
a unified adapter interface is arranged between the adapter corresponding to each bottom data source type and all the bottom data sources;
and calling the unified adapter interface through the adapter corresponding to each bottom data source type, and carrying out data virtualization on the data information in each bottom data source to obtain a plurality of initial data sources.
4. The multi-source heterogeneous data collection and fusion method according to claim 3, wherein the types of the underlying data sources comprise a relational database, a NoSQL database, a Word format document, an Excel format document, a rest pi service and a web page, the adapters corresponding to the relational database are relational database adapters, the adapters corresponding to the NoSQL database are NoSQL database adapters, the adapters corresponding to the Word format document are Word format document adapters, the adapters corresponding to the Excel format document are Excel format document adapters, the adapters corresponding to the rest pi service are rest pi service adapters, and the adapters corresponding to the web page are web crawler adapters.
5. The method for collecting and fusing heterogeneous data according to claim 3, wherein said calling the unified adapter interface through the adapter corresponding to each bottom data source type, performing data virtualization on the data information in each bottom data source, and obtaining a plurality of initial data sources includes:
for the adapter corresponding to each bottom data source type, calling the unified adapter interface through the adapter corresponding to each bottom data source type, and importing initial data in each bottom data source;
and converting the initial data in each imported underlying data source into initial data in a standard form, and acquiring a plurality of initial data sources.
6. The method for multi-source heterogeneous data collection and fusion according to claim 3, wherein the step of converting the initial data in each imported underlying data source into initial data in a standard form, and obtaining a plurality of initial data sources further comprises the steps of:
extracting initial data in a standard form, acquiring metadata information, and packaging the metadata information to obtain a plurality of preset packaging tables, wherein the preset packaging tables comprise the following information:
storing network position information of a server of the bottom data source;
logging in connection information of a database corresponding to the bottom data source, wherein the connection information comprises a database driver, a URL, a user name and a password;
the name, owner and creation date of the underlying data source;
the structure of the underlying data source comprises each column name and annotation of the source table;
the definition of each column in the bottom data source comprises a data type, a main key and a null;
the available primary and foreign keys defined in the underlying data source;
distribution information of the number of columns in the underlying data source and the value of each column, the distribution information being extracted for query optimization;
and recording line number information and occupied storage information in the bottom data source.
7. The method for collecting and fusing heterogeneous multi-source data according to claim 6, wherein the extracting initial data in standard form, obtaining metadata information, and packaging the metadata information to obtain a plurality of preset packaging tables, includes:
determining a mapping corresponding to each preset encapsulation table, wherein the mapping comprises row selection, column connection, conversion, column and change of table names;
and converting each preset encapsulation table through the mapping corresponding to each preset encapsulation table to obtain each preset virtual table.
8. The method for collecting and fusing heterogeneous multi-source data according to any one of claims 1 to 7, wherein the accessing the query range in the target data source by the search method, returning the access result, includes:
and optimizing and executing the query statement corresponding to the search mode in a data federation query or real-time mirroring mode, extracting data from the cache or the target data source end, merging and assembling, and returning the access result in a data format contained in the user request.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a multi-source heterogeneous data collection and fusion method according to any one of claims 1 to 8 when the program is executed by the processor.
10. A non-transitory computer readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a multi-source heterogeneous data collection and fusion method according to any of claims 1 to 8.
CN202310514985.9A 2023-05-09 2023-05-09 Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium Pending CN116738356A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310514985.9A CN116738356A (en) 2023-05-09 2023-05-09 Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310514985.9A CN116738356A (en) 2023-05-09 2023-05-09 Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116738356A true CN116738356A (en) 2023-09-12

Family

ID=87903465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310514985.9A Pending CN116738356A (en) 2023-05-09 2023-05-09 Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116738356A (en)

Similar Documents

Publication Publication Date Title
CN107463635B (en) Method for inquiring picture data and distributed NewSQL database system
US11681702B2 (en) Conversion of model views into relational models
US20220035815A1 (en) Processing database queries using format conversion
JP5819376B2 (en) A column smart mechanism for column-based databases
Martinez et al. Integrating data warehouses with web data: A survey
Agrawal et al. Asynchronous view maintenance for VLSD databases
CN110431545A (en) Inquiry is executed for structural data and unstructured data
US20230073666A1 (en) Data query method and apparatus, device, and computer-readable storage medium
CN108536761A (en) Report data querying method and server
US8224807B2 (en) Enhanced utilization of query optimization
CN111949650A (en) Multi-language fusion query method and multi-mode database system
CN103488673A (en) Method, controller, program and data storage system for performing reconciliation processing
Li et al. An integration approach of hybrid databases based on SQL in cloud computing environment
Roijackers et al. On bridging relational and document-centric data stores
CN114116716A (en) Hierarchical data retrieval method, device and equipment
CN111966692A (en) Data processing method, medium, device and computing equipment for data warehouse
CN111078781A (en) Multi-source streaming big data fusion convergence processing framework model implementation method
CN116739336A (en) Power grid disaster early warning method and system based on multi-source heterogeneous data fusion model
CN114297224A (en) RDF-based heterogeneous data integration and query system and method
US9275059B1 (en) Genome big data indexing
CN116738356A (en) Multi-source heterogeneous data acquisition and fusion method, equipment and storage medium
Dvoretskyi et al. Data Utility Assessment while Optimizing the Structure and Minimizing the Volume of a Distributed Database Node.
Hua et al. Architectural support for business intelligence: a push‐pull mechanism
Serrano et al. From relations to multi-dimensional maps: A SQL-to-hbase transformation methodology
CN103870497B (en) Column intelligent mechanism for per-column database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination