CN116881348A - Multi-source heterogeneous data storage method, device, computer equipment and storage medium - Google Patents

Multi-source heterogeneous data storage method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN116881348A
CN116881348A CN202310847701.8A CN202310847701A CN116881348A CN 116881348 A CN116881348 A CN 116881348A CN 202310847701 A CN202310847701 A CN 202310847701A CN 116881348 A CN116881348 A CN 116881348A
Authority
CN
China
Prior art keywords
data
source
item
source path
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310847701.8A
Other languages
Chinese (zh)
Inventor
李鹏
黄文琦
梁凌宇
曹尚
张焕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southern Power Grid Digital Grid Research Institute Co Ltd
Original Assignee
Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southern Power Grid Digital Grid Research Institute Co Ltd filed Critical Southern Power Grid Digital Grid Research Institute Co Ltd
Priority to CN202310847701.8A priority Critical patent/CN116881348A/en
Publication of CN116881348A publication Critical patent/CN116881348A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a multi-source heterogeneous data storage method, a multi-source heterogeneous data storage device, computer equipment and a storage medium. The method comprises the following steps: extracting attention data corresponding to attention items from at least two data sources; processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data; determining a value density parameter of a data source in the source path according to the attention item and the data item related to the data source in the source path; and storing the specification data, the source path of the specification data and the value density parameters of the data sources in the source path in a database. By adopting the method, the efficiency of extracting and storing the multi-source heterogeneous data can be improved, and the tracing of the multi-source heterogeneous data is realized.

Description

Multi-source heterogeneous data storage method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for storing multi-source heterogeneous data, a computer device, and a storage medium.
Background
Data in the real world often comes from different data sources and is stored in different forms. Therefore, how to quickly extract valuable data from multiple data sources and integrate the multi-source heterogeneous data together for efficient analysis and utilization is a very important issue when integrating and processing the data.
With the development of society, the amount of data which can be collected is larger and larger, and the ratio of valuable data is lower and lower. In the prior art, the multi-source heterogeneous data is extracted and stored in a mode that data are collected from different data sources through a collection tool, and then the data are integrated and stored according to a data warehouse technology. However, this method has the problems of low data extraction efficiency, unable data tracing and low accuracy.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a multi-source heterogeneous data storage method, apparatus, computer device and storage medium capable of improving the efficiency of extracting and storing multi-source heterogeneous data and implementing tracing of the multi-source heterogeneous data.
In a first aspect, the present application provides a method for storing multi-source heterogeneous data, the method comprising:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
The canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
In one embodiment, extracting data of interest corresponding to an item of interest from at least two data sources includes:
extracting, for each data source, initial data from the data source;
identifying whether a data item of the initial data contains a concerned item;
if yes, extracting the concerned data corresponding to the concerned item from the initial data.
In one embodiment, processing the extracted data of interest to obtain canonical data includes:
converting and de-duplicating the concerned data of the same concerned item extracted from different data sources to obtain canonical data; and/or the number of the groups of groups,
and converting and fusing the extracted attention data of different attention items to obtain the specification data.
In one embodiment, determining the source path of the specification data includes:
and constructing a source path of the specification data according to the relation between the data of interest and each data source and the relation between the data of interest and the specification data.
In one embodiment, determining a value density parameter for a data source in a source path from a term of interest and a term of data related to the data source in the source path comprises:
Determining a number of common items between the item of interest and data items related to the data source in the source path;
the ratio between the number of common items and the number of data items involved by the data source in the source path is used as a value density parameter of the data source in the source path.
In one embodiment, determining a value density parameter for a data source in a source path from a term of interest and a term of data related to the data source in the source path comprises:
weighting the quantity of the concerned items and the quantity of the data items related to the data sources in the source path by adopting set parameters to obtain intermediate variables;
and normalizing the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
In a second aspect, the present application also provides a multi-source heterogeneous data storage device, the device comprising:
the data extraction module is used for extracting the concerned data corresponding to the concerned item from at least two data sources;
the path determining module is used for processing the extracted data of interest to obtain standard data and determining a source path of the standard data;
the parameter determining module is used for determining the value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
And the data storage module is used for storing the standard data, the source path of the standard data and the value density parameters of the data sources in the source path in the database.
In a third aspect, the present application also provides a computer device comprising a memory storing a computer program and a processor implementing the following steps when executing the computer program:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
extracting attention data corresponding to attention items from at least two data sources;
Processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
According to the multi-source heterogeneous data storage method, the multi-source heterogeneous data storage device, the computer equipment and the storage medium, the concerned data corresponding to the concerned item are extracted from the plurality of data sources, the concerned data are processed to obtain the standard data, the source path of the standard data is determined, and the obtained standard data and the source path of the standard data are stored in the database, so that the effect of quickly tracing the standard data can be achieved; further, according to the data items related to the data sources in the attention item and the source path, the value density parameters of the data sources in the source path are determined, and then the value density parameters of the data sources in the source path are stored in the database, so that when the attention data corresponding to the attention item is extracted in a future period, the data extraction of the data sources with large value density parameters can be preferentially performed, and the efficiency and the accuracy of extracting and storing the multi-source heterogeneous data are improved.
Drawings
FIG. 1 is a flow diagram of a method of multi-source heterogeneous data storage in one embodiment;
FIG. 2 is a block diagram of extracting data of interest from a data source in one embodiment;
FIG. 3 is a block diagram of extracting specification data from data of interest in one embodiment;
FIG. 4 is a flow chart of determining a value density parameter of a data source in a source path according to one embodiment;
FIG. 5 is a flow chart of another embodiment for determining a value density parameter of a data source in a source path;
FIG. 6 is a block diagram of a multi-source heterogeneous data storage device in one embodiment;
FIG. 7 is a block diagram of a multi-source heterogeneous data storage device in accordance with another embodiment;
FIG. 8 is a block diagram of a multi-source heterogeneous data storage device in yet another embodiment;
FIG. 9 is a block diagram of a multi-source heterogeneous data storage device in accordance with yet another embodiment;
fig. 10 is an internal structural view of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The multi-source heterogeneous data storage method provided by the embodiment of the application can be applied to a scene of how to extract multi-source heterogeneous data and store the extracted data into a database. Alternatively, the multi-source heterogeneous data storage method may be performed by a server. The data storage system may store data that the server needs to process, such as initial data in a data source, among others. The data storage system may be integrated on a server or may be placed on a cloud or other network server. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 1, a multi-source heterogeneous data storage method is provided, and is described by taking application of the method to a server as an example, it is understood that the method can also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:
s101, extracting attention data corresponding to attention items from at least two data sources.
In this embodiment, the data source is a source of multiple data, and may be multiple. The attention item is an attention data index such as the power generation amount in unit time, the power generation duration in one day, and the like. The attention data is specific data corresponding to each attention item, for example, the power generation time length is 12 hours.
Specifically, an attention item to be focused can be preset, and the attention item is searched from each data source; further, attention data corresponding to the attention items are extracted from the data sources.
Alternatively, in order to make the extracted data of interest more accurate, an embodiment is provided in which, for each data source, initial data is extracted from the data source; identifying whether a data item of the initial data contains a concerned item; if yes, extracting the concerned data corresponding to the concerned item from the initial data.
In this embodiment, the initial data is all the data which is directly collected from the data source and is not processed and screened. The data items are data indexes corresponding to the initial data, such as population density, weather temperature and the like.
For each data source, all data in the data source can be extracted to obtain initial data; further, analyzing the initial data to obtain corresponding data items in each initial data, and further determining whether the data items of the initial data contain preset attention items; optionally, if yes, extracting the attention data corresponding to the attention item from the initial data according to the attention item; if not, the data source does not contain the attention data corresponding to the attention item, and a label can be set for the data source for indicating that the data source does not contain the attention data.
It will be appreciated that for each data source, initial data is extracted from the data source to determine whether items of interest are contained in items of the initial data; if yes, the method can accurately and rapidly extract the concerned data corresponding to the concerned item from the data source; if not, the data does not need to be extracted from the data source, and when the attention data corresponding to the attention item is extracted later, the data source can be eliminated, and resources are not required to be wasted.
S102, processing the extracted data of interest to obtain specification data, and determining a source path of the specification data.
In this embodiment, the specification data is data with uniform data format and standard data content. The source path of the canonical data may be used to describe the source of the canonical data, i.e., the correspondence between the canonical data and the data of interest and the data source.
Specifically, after the attention data is extracted from a plurality of data sources, the attention data can be subjected to data cleaning and other processes, so as to obtain standard data; further, statistical analysis can be performed on the sources of the specification data to determine the source path of each specification data.
S103, determining the value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path.
In this embodiment, the value density parameter of the data source is a parameter describing the value of the data source; the greater the value density parameter of a data source, the higher the value of that data source.
Specifically, after determining the source paths of the specification data, the data items related to the data sources in each source path can be determined; further, the preset attention item and the data item related to the data source in the source path can be input into a preset value determination model, and the value density parameter of the data source in the source path is determined through the value determination model.
S104, storing the standard data, the source path of the standard data and the value density parameters of the data sources in the source path in a database.
In this embodiment, the Database is a repository for storing data, and may be HBase (a distributed, array-oriented open source Database), HDFS (Hadoop Distributed File System, a distributed file system adapted to run on general-purpose hardware), or the like.
Specifically, after the canonical data, the source path of the canonical data and the value density parameters of the data sources in the source path are obtained, the canonical data extracted from each data source, the source path of the canonical data and the value density parameters of the data sources in the source path can be respectively stored in different forms in the database, and each form is named so as to analyze and process each data later.
Optionally, after obtaining the specification data, each specification data may be stored in a database form for storing the specification data, and the specification data is stored according to a data item corresponding to the specification data. For example, if the data item corresponding to the standard data is the daily electricity consumption of a certain unit, the standard data may be stored under the field of the daily electricity consumption under the database form, so that the standard data can be quickly acquired later.
According to the multi-source heterogeneous data storage method, the concerned data corresponding to the concerned item is extracted from the plurality of data sources, the concerned data is processed to obtain the standard data, the source path of the standard data is determined, and the obtained standard data and the source path of the standard data are stored in the database, so that the effect of tracing the standard data can be achieved rapidly; further, according to the data items related to the data sources in the attention item and the source path, the value density parameters of the data sources in the source path are determined, and then the value density parameters of the data sources in the source path are stored in the database, so that when the attention data corresponding to the attention item is extracted in a future period, the data extraction of the data sources with large value density parameters can be preferentially performed, and the efficiency and the accuracy of extracting and storing the multi-source heterogeneous data are improved.
In order to make the obtained specification data more compact and accurate, in one embodiment, the conversion and deduplication processing may be performed on the data of interest of the same item extracted from different data sources to obtain the specification data.
Specifically, if there is attention data of the same attention item extracted from different data sources in the extracted attention data, data conversion can be performed on the attention data corresponding to the same attention item; further, the duplicate removal processing is performed on the attention data corresponding to the same attention item after completing the data conversion, so as to obtain a piece of specification data. For example, if there are two pieces of attention data extracted from two data sources, each of which is corresponding to the amount of power generation per unit time, and one of the pieces of attention data is in a picture format, the data conversion may be performed on the piece of attention data, and the piece of attention data may be converted into attention data in a character format; furthermore, the two data of interest subjected to format conversion can be subjected to deduplication processing, and finally one piece of standard data corresponding to the power generation amount of the item of interest in unit time is obtained.
Furthermore, the extracted attention data of different attention items can be converted and fused to obtain the specification data.
Specifically, the extracted attention data can be analyzed to determine whether the condition that attention items corresponding to the attention data can be fused exists; if yes, merging the attention items which can be fused to obtain a new attention item; data conversion is carried out on the attention data corresponding to the attention items which can be fused, so that the attention data formats corresponding to the attention items are unified; furthermore, the data of interest subjected to format conversion can be subjected to fusion processing, so that one piece of standard data is obtained. For example, if there is a time of day power generation of a unit for a attention item corresponding to attention data and an hourly power generation amount of the unit for a attention item corresponding to attention data in the extracted attention data, the two attention items may be fused to obtain a new attention item as the daily power generation amount of the unit; further, the two pieces of attention data may be fused to obtain the specification data of the daily power generation amount of the unit as one attention item.
It can be understood that the extracted attention data is processed by performing operations of data conversion, data deduplication and data fusion on the extracted attention data, so that the obtained standard data is simpler and more accurate, and further the efficiency of storing the standard data with higher accuracy in the follow-up degree is higher.
In order to make the determined source path of the canonical data clearer and more accurate, in one embodiment, the source path of the canonical data may be constructed according to the relationship between the data of interest and each data source, and the relationship between the data of interest and the canonical data.
Specifically, after the attention data is extracted from the data sources, a relationship between each attention data and each data source may be established; further, after extracting the specification data from the concerned data, the relationship between each concerned data and each specification data can be established; the relationship between each data source and each concerned data and the relationship between each concerned data and each standard data are connected through the concerned data, so that a source path of the standard data can be constructed. For example, if four attention data are extracted from three data sources, that is, attention data 1 and attention data 2 are extracted from data source 1, attention data 3 is extracted from data source 2, and attention data 4 is extracted from data source 3, the relationship between each data source and each attention data as shown in fig. 2 can be constructed; further, if the four pieces of attention data are processed to obtain 3 pieces of specification data, that is, the attention data 1 is processed to obtain the specification data 1, the attention data 2 is processed to obtain the specification data 2, and the attention data 3 and the attention data 4 are processed to obtain the specification data 3, a relationship diagram between each piece of attention data and each piece of specification data as shown in fig. 3 can be constructed; the relationship between each data source and each concerned data and the relationship between each concerned data and each standard data are connected through the concerned data, so that a source path of the standard data can be constructed.
It can be understood that by establishing the relationship between each data source and each data of interest and establishing the relationship between each data of interest and each standard data, the source path of the standard data is further constructed, so that the source path of the constructed standard data is clearer and more accurate.
In one embodiment, as shown in fig. 4, an implementation manner is provided for determining a value density parameter of a data source in a source path, which specifically includes the following steps:
s401, the number of common items between the concerned item and the data items related to the data source in the source path is determined.
Specifically, the number of data sources involved in the source path may be determined; furthermore, according to the preset attention item, the attention item is matched with the data item related to each data source in the source path, so that the common item between the attention item and the data item related to each data source in the source path can be determined, and the data of the common item can be determined.
S402, the ratio between the number of the common items and the number of the data items related to the data sources in the source path is used as a value density parameter of the data sources in the source path.
Specifically, after determining the number of the common terms, the ratio between the number of the common terms and the number of the data terms related to the data sources in the source path can be used as the value density parameter of the data sources in the source path by comparing the number of the common terms with the number of the data terms related to the data sources in the source path according to the following formula (1).
p i =N ei /N oi (1)
Wherein p is i Representing the value density parameter, N, of the ith data source in the source path ei Representing the number of common items between the item of interest and the data item involved in the ith data source in the source path, N oi Representing the number of data items involved in the ith data source in the source path.
It can be appreciated that by determining the number of common items between the data items related to the data source in the source path and the attention item, and taking the ratio between the number of common items and the number of data items designed by the data source in the source path as the value density parameter of the data source, the value density parameter of the data source in the source path can be more accurate, and thus the value of each data source can be more accurately determined, so that when the attention data needs to be acquired in a future period, the data extraction can be preferentially performed on the data source with higher value.
Alternatively, as shown in fig. 5, another embodiment for determining a value density parameter of a data source in a source path may include the following steps:
s501, weighting the number of the concerned items and the number of the data items related to the data sources in the source path by adopting the set parameters to obtain intermediate variables.
Wherein the set parameters are preset related parameters; the intermediate variable is a variable obtained in the process of determining the value density parameter of the data source in the source path.
Specifically, after determining the number of the attention items and the number of the data items related to the data source in the source path, the number of the attention items and the number of the data items related to the data source in the source path may be weighted by the following formula (2), to determine the intermediate variable.
q i =a 0 +a 1 ×N oi +a 2 ×N ei (2)
Wherein q i Indicated are intermediate variables; a, a 0 、a 1 And a 2 I.e. the preset setting parameters.
S502, performing normalization processing on the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
Specifically, after the intermediate variable is obtained, the intermediate variable can be normalized by adopting an activation function, such as a sigmoid function, to obtain the value density parameter of the data source in the source path through the following formula (3).
p i =sigmoid(q i ) (3)
It can be understood that the number of the concerned items and the number of the data items related to the data sources in the source path are weighted to obtain intermediate variables, and then the intermediate variables are normalized by adopting an activation function, so that the value density parameters of the obtained data sources in the source path can be more accurate, and the value of each data source can be more accurately determined, and therefore, when the concerned data needs to be acquired in a future period, the data extraction can be preferentially performed on the data sources with higher values.
In one embodiment, as shown in FIG. 6, a preferred example of a multi-source heterogeneous data storage method is provided. The specific process is as follows:
s601, for each data source, extracting initial data from the data source.
S602, identifying whether a data item of initial data contains a concerned item or not; if not, executing S603; if yes, then S604 is performed.
S603, eliminating the data source corresponding to the initial data.
S604, extracting attention data corresponding to the attention item from the initial data.
S605, converting and de-duplicating the concerned data of the same concerned item extracted from different data sources to obtain the standard data.
S606, converting and fusing the extracted attention data of different attention items to obtain specification data.
S607, constructing a source path of the specification data according to the relation between the data of interest and each data source and the relation between the data of interest and the specification data.
S608, determining the number of common items between the item of interest and the data item involved in the data source in the source path.
S609, the ratio between the number of the common items and the number of the data items related to the data sources in the source path is used as the value density parameter of the data sources in the source path.
S610, storing the specification data, the source path of the specification data and the value density parameters of the data sources in the source path in a database.
The specific process of S601 to S610 may refer to the description of the above method embodiment, and its implementation principle and technical effects are similar, and are not repeated here.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a multi-source heterogeneous data storage device for realizing the multi-source heterogeneous data storage method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the multi-source heterogeneous data storage device provided below may be referred to the limitation of the multi-source heterogeneous data storage method hereinabove, and will not be repeated herein.
In one embodiment, as shown in FIG. 7, there is provided a multi-source heterogeneous data storage device 1, comprising: a data extraction module 10, a path determination module 20, a parameter determination module 30, and a data storage module 40, wherein:
a data extraction module 10, configured to extract attention data corresponding to an attention item from at least two data sources;
a path determining module 20, configured to process the extracted data of interest to obtain canonical data, and determine a source path of the canonical data;
a parameter determining module 30, configured to determine a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the data storage module 40 is configured to store the canonical data, the source path of the canonical data, and the value density parameter of the data source in the source path in the database.
In one embodiment, based on the above fig. 7, as shown in fig. 8, the data extraction module 10 may include:
a first extraction unit 11 for extracting, for each data source, initial data from the data source;
an item determination unit 12 for identifying whether or not a focused item is contained in the data items of the initial data;
a second extraction unit 13 for extracting attention data corresponding to the attention item from the initial data.
In one embodiment, based on the above fig. 7 or fig. 8, as shown in fig. 9, the path determining module 20 may include:
a first processing unit 21, configured to perform conversion and deduplication processing on attention data of the same attention item extracted from different data sources, to obtain specification data;
the second processing unit 22 is configured to perform conversion and fusion processing on the extracted attention data of different attention items, so as to obtain specification data.
In one embodiment, the path determining module 20 may specifically be configured to:
and constructing a source path of the specification data according to the relation between the data of interest and each data source and the relation between the data of interest and the specification data.
In one embodiment, the parameter determining module 30 may specifically be configured to:
Determining a number of common items between the item of interest and data items related to the data source in the source path; the ratio between the number of common items and the number of data items involved by the data source in the source path is used as a value density parameter of the data source in the source path.
In one embodiment, the parameter determination module 30 described above may also be used to:
weighting the quantity of the concerned items and the quantity of the data items related to the data sources in the source path by adopting set parameters to obtain intermediate variables; and normalizing the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
The various modules in the multi-source heterogeneous data storage device described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store initial data in the data source, etc. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a multi-source heterogeneous data storage method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
In one embodiment, when the processor executes logic for extracting data of interest corresponding to an item of interest from at least two data sources, the processor further performs the steps of:
Extracting, for each data source, initial data from the data source; identifying whether a data item of the initial data contains a concerned item; if yes, extracting the concerned data corresponding to the concerned item from the initial data.
In one embodiment, the logic of the processor executing the computer program to process the extracted data of interest to obtain canonical data further performs the steps of:
converting and de-duplicating the concerned data of the same concerned item extracted from different data sources to obtain canonical data; and/or converting and fusing the extracted attention data of different attention items to obtain the specification data.
In one embodiment, the logic of the computer program to determine the source path of the specification data when executed by the processor further performs the steps of:
and constructing a source path of the specification data according to the relation between the data of interest and each data source and the relation between the data of interest and the specification data.
In one embodiment, logic for determining a value density parameter for a data source in a source path from a data item of interest and a data item related to the data source in the source path is further implemented by a processor executing a computer program to:
Determining a number of common items between the item of interest and data items related to the data source in the source path; the ratio between the number of common items and the number of data items involved by the data source in the source path is used as a value density parameter of the data source in the source path.
In one embodiment, logic for determining a value density parameter for a data source in a source path from a data item of interest and a data item related to the data source in the source path is further implemented by a processor executing a computer program to:
weighting the quantity of the concerned items and the quantity of the data items related to the data sources in the source path by adopting set parameters to obtain intermediate variables; and normalizing the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
Determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
In one embodiment, the logic of the computer program for extracting the data of interest corresponding to the item of interest from at least two data sources, when executed by the processor, further performs the steps of:
extracting, for each data source, initial data from the data source;
identifying whether a data item of the initial data contains a concerned item;
if yes, extracting the concerned data corresponding to the concerned item from the initial data.
In one embodiment, the logic of the computer program for processing the extracted data of interest to obtain canonical data, when executed by the processor, further performs the steps of:
converting and de-duplicating the concerned data of the same concerned item extracted from different data sources to obtain canonical data; and/or converting and fusing the extracted attention data of different attention items to obtain the specification data.
In one embodiment, the logic of the computer program determining the source path of the specification data when executed by the processor further performs the steps of:
And constructing a source path of the specification data according to the relation between the data of interest and each data source and the relation between the data of interest and the specification data.
In one embodiment, the logic of the computer program for determining the value density parameter of the data source in the source path based on the item of interest and the data item related to the data source in the source path further performs the following steps when executed by the processor:
determining a number of common items between the item of interest and data items related to the data source in the source path; the ratio between the number of common items and the number of data items involved by the data source in the source path is used as a value density parameter of the data source in the source path.
In one embodiment, the logic of the computer program for determining the value density parameter of the data source in the source path based on the item of interest and the data item related to the data source in the source path further performs the following steps when executed by the processor:
weighting the quantity of the concerned items and the quantity of the data items related to the data sources in the source path by adopting set parameters to obtain intermediate variables; and normalizing the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
the canonical data, the source path of the canonical data, and the value density parameters of the data sources in the source path are stored in a database.
In one embodiment, the logic of the computer program for extracting the data of interest corresponding to the item of interest from at least two data sources, when executed by the processor, further performs the steps of:
extracting, for each data source, initial data from the data source;
identifying whether a data item of the initial data contains a concerned item;
if yes, extracting the concerned data corresponding to the concerned item from the initial data.
In one embodiment, the logic of the computer program for processing the extracted data of interest to obtain canonical data, when executed by the processor, further performs the steps of:
Converting and de-duplicating the concerned data of the same concerned item extracted from different data sources to obtain canonical data; and/or converting and fusing the extracted attention data of different attention items to obtain the specification data.
In one embodiment, the logic of the computer program determining the source path of the specification data when executed by the processor further performs the steps of:
and constructing a source path of the specification data according to the relation between the data of interest and each data source and the relation between the data of interest and the specification data.
In one embodiment, the logic of the computer program for determining the value density parameter of the data source in the source path based on the item of interest and the data item related to the data source in the source path further performs the following steps when executed by the processor:
determining a number of common items between the item of interest and data items related to the data source in the source path; the ratio between the number of common items and the number of data items involved by the data source in the source path is used as a value density parameter of the data source in the source path.
In one embodiment, the logic of the computer program for determining the value density parameter of the data source in the source path based on the item of interest and the data item related to the data source in the source path further performs the following steps when executed by the processor:
Weighting the quantity of the concerned items and the quantity of the data items related to the data sources in the source path by adopting set parameters to obtain intermediate variables; and normalizing the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
The data (including but not limited to the initial data in the data source) related to the present application is information and data fully authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of multi-source heterogeneous data storage, the method comprising:
extracting attention data corresponding to attention items from at least two data sources;
processing the extracted data of interest to obtain canonical data, and determining a source path of the canonical data;
determining a value density parameter of a data source in the source path according to the attention item and the data item related to the data source in the source path;
And storing the specification data, the source path of the specification data and the value density parameters of the data sources in the source path in a database.
2. The method of claim 1, wherein extracting the data of interest corresponding to the item of interest from at least two data sources comprises:
extracting, for each data source, initial data from the data source;
identifying whether the attention item is contained in the data item of the initial data;
if yes, extracting the concerned data corresponding to the concerned item from the initial data.
3. The method of claim 1, wherein processing the extracted data of interest to obtain canonical data comprises:
converting and de-duplicating the concerned data of the same concerned item extracted from different data sources to obtain canonical data; and/or the number of the groups of groups,
and converting and fusing the extracted attention data of different attention items to obtain the specification data.
4. The method of claim 1, wherein said determining the source path of the specification data comprises:
and constructing a source path of the specification data according to the relation between the concerned data and each data source and the relation between the concerned data and the specification data.
5. The method of claim 1, wherein determining the value density parameter of the data source in the source path from the item of interest and the data item involved with the data source in the source path comprises:
determining a number of common items between the item of interest and data items involved in the data source in the source path;
and taking the ratio between the number of the common items and the number of the data items related to the data sources in the source path as a value density parameter of the data sources in the source path.
6. The method of claim 1, wherein determining the value density parameter of the data source in the source path from the item of interest and the data item involved with the data source in the source path comprises:
weighting the quantity of the concerned items and the quantity of the data items related to the data sources in the source path by adopting set parameters to obtain intermediate variables;
and normalizing the intermediate variable by adopting an activation function to obtain the value density parameter of the data source in the source path.
7. A multi-source heterogeneous data storage device, the device comprising:
The data extraction module is used for extracting the concerned data corresponding to the concerned item from at least two data sources;
the path determining module is used for processing the extracted data of interest to obtain specification data and determining a source path of the specification data;
the parameter determining module is used for determining a value density parameter of the data source in the source path according to the attention item and the data item related to the data source in the source path;
and the data storage module is used for storing the specification data, the source path of the specification data and the value density parameters of the data sources in the source path in a database.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202310847701.8A 2023-07-11 2023-07-11 Multi-source heterogeneous data storage method, device, computer equipment and storage medium Pending CN116881348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310847701.8A CN116881348A (en) 2023-07-11 2023-07-11 Multi-source heterogeneous data storage method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310847701.8A CN116881348A (en) 2023-07-11 2023-07-11 Multi-source heterogeneous data storage method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116881348A true CN116881348A (en) 2023-10-13

Family

ID=88261591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310847701.8A Pending CN116881348A (en) 2023-07-11 2023-07-11 Multi-source heterogeneous data storage method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116881348A (en)

Similar Documents

Publication Publication Date Title
US11941016B2 (en) Using specified performance attributes to configure machine learning pipepline stages for an ETL job
US9460188B2 (en) Data warehouse compatibility
KR101983538B1 (en) Systems and methods for calculating category proportions
CN114579584B (en) Data table processing method and device, computer equipment and storage medium
Mohamed et al. A review on big data management and decision-making in smart grid
Kim et al. Two-phase edge outlier detection method for technology opportunity discovery
CN113626571B (en) Method, device, computer equipment and storage medium for generating answer sentence
CN114238085A (en) Interface testing method and device, computer equipment and storage medium
CN116738009B (en) Method for archiving and backtracking data
CN116911867A (en) Problem processing method, device, computer equipment and storage medium
CN107430633B (en) System and method for data storage and computer readable medium
CN107357919A (en) User behaviors log inquiry system and method
CN116561607A (en) Method and device for detecting abnormality of resource interaction data and computer equipment
CN116881348A (en) Multi-source heterogeneous data storage method, device, computer equipment and storage medium
CN113779248A (en) Data classification model training method, data processing method and storage medium
Manu et al. A current trends in big data landscape
Mokashi et al. Agricultural Sensor Captured Data Analysis for Optimum Storage
CN116383883B (en) Big data-based data management authority processing method and system
Diván et al. Articulating heterogeneous data streams with the attribute-relation file format
Ekici et al. Electricity Consumption Analysis with Matlab Tall Arrays
CN118152504A (en) Unstructured data indexing method, device, apparatus, medium and program product
Nama et al. Issues and Challenges in Mining Large Data Sets
Satpathy et al. A decision-driven computer forensic classification using id3 algorithm
CN115936312A (en) Electronic component evaluation method and device, computer equipment and storage medium
CN115146051A (en) Sample processing method, sample processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination