CN114385705A - Data importance identification method, device, equipment and medium - Google Patents

Data importance identification method, device, equipment and medium Download PDF

Info

Publication number
CN114385705A
CN114385705A CN202111551081.0A CN202111551081A CN114385705A CN 114385705 A CN114385705 A CN 114385705A CN 202111551081 A CN202111551081 A CN 202111551081A CN 114385705 A CN114385705 A CN 114385705A
Authority
CN
China
Prior art keywords
node
task
nodes
score
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111551081.0A
Other languages
Chinese (zh)
Inventor
傅文易
甘红伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lianlian Hangzhou Information Technology Co ltd
Original Assignee
Lianlian Hangzhou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lianlian Hangzhou Information Technology Co ltd filed Critical Lianlian Hangzhou Information Technology Co ltd
Priority to CN202111551081.0A priority Critical patent/CN114385705A/en
Publication of CN114385705A publication Critical patent/CN114385705A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data identification, in particular to a method, a device, equipment and a medium for identifying data importance, wherein the method comprises the following steps: acquiring node information of each computing task node in a data warehouse and an incidence relation between the computing task nodes; generating at least one task link graph based on the incidence relation among the computing task nodes; determining a terminal node of each task link based on the task link graph; determining a score corresponding to the terminal node based on the node information of the terminal node; determining the score of each calculation task node in the task link graph based on the scores corresponding to all the terminal nodes; and determining important task nodes based on the corresponding scores of the computing task nodes. By the method and the system, the important task nodes can be managed by maintenance personnel, and freezing or offline caused by long-term maintenance-free of the important task nodes is avoided.

Description

Data importance identification method, device, equipment and medium
Technical Field
The present invention relates to the field of data identification technologies, and in particular, to a method, an apparatus, a device, and a medium for identifying importance of data.
Background
In the prior art, more and more data are generated in enterprises, and data warehouses are mostly adopted to store data generated by the enterprises, so that centralized storage of the data is guaranteed. However, in practical applications, different research and development teams of an enterprise only pay attention to and maintain the respective responsible computing tasks, and the computing tasks are continuously updated and online, so that a large number of discarded, unmanned and unimportant computing tasks appear in a data warehouse in the past, a large number of computing and storage resources are occupied, important and core computing tasks are often not identified and mainly guaranteed, and a data warehouse manager generally has difficulty in identifying the importance of a specific computing task from a global dimension.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a data importance identification method, which comprises the following steps:
acquiring node information of each computing task node in a data warehouse and an incidence relation between the computing task nodes;
generating at least one task link graph based on the incidence relation among the computing task nodes, wherein the task link graph comprises a plurality of task links, and the task links comprise a plurality of computing task nodes connected through directional links;
determining a terminal node of each task link based on the task link graph;
determining a score corresponding to the terminal node based on the node information of the terminal node;
determining the score of each calculation task node in the task link graph based on the scores corresponding to all the terminal nodes;
and determining important task nodes based on the corresponding scores of the computing task nodes.
Further, the determining the score of each calculation task node in the task link graph based on the scores corresponding to all the terminal nodes includes:
selecting a node to be evaluated in the task link graph, wherein the node to be evaluated is a computing task node which is not evaluated and is evaluated by a computing task node directly connected with the node to be evaluated at the downstream;
determining the score of the node to be evaluated based on the score of a downstream computing task node directly connected with the node to be evaluated;
and (3) reselecting the node to be evaluated, and repeatedly executing: and determining the score of the node to be evaluated based on the score of a downstream computing task node directly connected with the node to be evaluated until each computing task node in the task link graph has a corresponding score.
Further, the node information includes: the method comprises the following steps of accessing times within preset time, derived times within preset time, access time corresponding to accessing and derived time corresponding to deriving;
the determining the score corresponding to the terminal node based on the node information of the terminal node includes:
and determining a score corresponding to the terminal node based on the number of times of access within the preset time, the number of times of derivation within the preset time, the access time corresponding to the access and the derivation time corresponding to the derivation.
Further, different ones of the task links have the same one of the compute task nodes;
determining the score of the node to be evaluated based on the score of the downstream computing task node directly connected with the node to be evaluated, wherein the score comprises;
judging whether the nodes to be evaluated exist in different links or not;
when the nodes to be evaluated exist in different links, obtaining the scores of all the calculation task nodes directly connected with the nodes to be evaluated at the downstream;
and giving the larger score of the scores of all the calculation task nodes directly connected with the node to be evaluated to the node to be evaluated.
Further, the node information further comprises a node label, and the node label is used for representing the importance degree of the computing task node;
before determining the important task node based on the corresponding score of each computing task node, the method further comprises:
setting the score of the computing task node with the node label as the maximum value of the scores of the computing task nodes.
Further, the determining important task nodes based on the corresponding scores of the computing task nodes comprises:
taking the computing task node with the score larger than a preset first score threshold value as the important task node;
or, sequencing the computing task nodes according to the scores from small to small, and taking the nodes sequenced in the previously specified ranking as important task nodes.
Further, the method further comprises: and freezing or taking off the computing task nodes with the scores lower than a preset second score threshold value according to the scores of the computing task nodes.
In another aspect, the present invention provides a data importance identification apparatus, including:
the data acquisition module is used for acquiring node information of each computing task node in the data warehouse and an incidence relation between the computing task nodes;
a link map generation module, configured to generate at least one task link map based on an association relationship between the computing task nodes, where the task link map includes a plurality of task links, and the task links include a plurality of computing task nodes connected by links with directivity;
the terminal node determining module is used for determining the terminal node of each task link based on the task link graph;
the first scoring module is used for determining a score corresponding to the terminal node based on the node information of the terminal node;
the second scoring module is used for determining the score of each calculation task node in the task link diagram based on the scores corresponding to all the terminal nodes;
and the important task node determining module is used for determining the important task node based on the corresponding scores of the computing task nodes.
In another aspect, the present invention provides an electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data importance identification method as described above.
In another aspect, the present invention provides a computer-readable storage medium, wherein instructions, when executed by a processor of a data importance identification apparatus/electronic device, enable the data importance identification apparatus/electronic device to perform the data importance identification method as described above.
The implementation of this application has following beneficial effect:
according to the method and the device, each calculation task node in the data warehouse is associated to the corresponding task link graph according to the association relation of each calculation task node in the data warehouse, wherein the task link graph connects the associated calculation task nodes through the directional links, the relation of different calculation task nodes can be visually displayed, the staggered condition of different task links can be integrally checked, and corresponding maintenance personnel can conveniently control the corresponding task links. And then, the terminal nodes of each task link in the task link graph can be scored according to the node information of the calculation task nodes, so that the scores of the calculation task nodes are obtained, important task nodes in the data warehouse can be determined through the scores, maintenance personnel can strengthen management on the important task nodes, and the important task nodes are prevented from being frozen or off-line due to long-term non-maintenance.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description of the embodiment or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is an architecture diagram of an implementation environment of a data importance identification method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data importance identification method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart diagram illustrating another data importance identification method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram illustrating a further data importance identification method according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a data importance identification apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or device.
For a better understanding of the present application, reference will now be made to the following terms:
the Data Warehouse, known in english under the name Data Warehouse, may be abbreviated as DW or DWH. The data warehouse is a strategic set which provides all types of data support for decision making processes of all levels of enterprises. It is a single data store created for analytical reporting and decision support purposes. And providing guidance for business process improvement, monitoring time, cost, quality and control for enterprises needing business intelligence. The input side of the data warehouse is various data sources, and the final output is used for data analysis, data mining, data reporting and the like of enterprises.
The data warehouse may include a temporary storage tier, a data warehouse tier, and a reference tier, wherein the ODS tier (temporary storage tier): the layer does work on the source pasting, the data and the data of the source system are isomorphic, the data are generally divided into full-volume updating and incremental updating, and some simple cleaning can be usually performed in the source pasting process; DW layer (data warehouse layer): the dates associated with some data are split, so that the data are classified more specifically, the dates are generally split into adults, months and days, ETL scripts from an ODS layer to a DW layer can clean and design the data according to business requirements, if no business requirements exist, processing is carried out according to a data structure of a source system and future planning, the data requirements of the layer are consistent and accurate, and the integrity of the data is established as much as possible. APP layer (reference layer): and providing data required by report and data sand table display.
It is understood that the data in the data warehouse may correlate different data in corresponding relationships, such as the relationship between orders (order number, payer …) and payments (amount, time of payment …), through a data model.
In order to implement the technical solution of the present application, so that more engineering workers can easily understand and apply the present application, the working principle of the present application will be further described with reference to specific embodiments.
Referring to fig. 1, fig. 1 is a diagram illustrating an implementation environment architecture of a data importance identification method according to an embodiment of the present application, and as shown in fig. 1, the application environment may include a server 01 and a terminal 02.
In an alternative embodiment, the server 01 may be an independent physical server, or may be a server cluster, a distributed system, or a cloud computing service center formed by a plurality of physical servers, at least one virtual machine server may be built in one physical server, and at least one application may be built on each virtual machine server. The server 01 may be used to control application service resources, and may also provide basic services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, and security services. It is to be understood that the data warehouse in the present specification may be stored in the above-described server.
In an alternative embodiment, the terminal 02 may include at least one terminal through which a user may access an application in a virtual server to develop a service. Specifically, the terminal 02 may include, but is not limited to, a smart phone, a desktop computer, a notebook computer, a digital assistant, a smart wearable device, and other types of electronic devices, which are not specifically limited in this embodiment of the present application. Optionally, the operating system running on the terminal 02 may include, but is not limited to, Windows, Linux, android system, IOS system, and the like.
In addition, the terminal 02 may display an application interface, through which the application interface may be used to receive an operation instruction of a user to complete data importance identification, and may also be used to display an importance identification result, where the importance identification result may be information of a plurality of important task nodes, such as a number (ID) of a calculation task node.
It should be noted that the terminal 02 in the present application may also store a data warehouse.
The terminal 02 establishes a communication connection with the server 01 through a wired or wireless network.
An embodiment of a data importance identification method according to the present application is described below, and fig. 2 is a schematic flow chart of a data importance identification method according to the embodiment of the present application, and the present specification provides the method operation steps as described in the embodiment or the flow chart, but more or less operation steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. Specifically, as shown in fig. 2, an execution subject of the scheme may be executed by the terminal 02 or the server 01, or may be executed by both of them, and the method may include:
s102, acquiring node information of each calculation task node in the data warehouse and association relations among the calculation task nodes.
In particular, the compute task node may be used to count or compute data in the data warehouse. There are multiple compute task nodes in the data store.
Specifically, the association relationship between the computing task nodes may be obtained through a data model, and the association relationship between the computing task nodes represents the correlation between different computing task nodes, and if the data in the computing task node 2 is obtained through the data in the computing task node 1, it is indicated that the computing task node 2 is associated with the computing task node 1. It is understood that there are associations of all the compute task nodes in the data warehouse in the data model.
Specifically, the node information may be understood as a usage of the corresponding computing task node.
In practical application, the node information and the association relation of each computing task node can be acquired from a data warehouse task scheduling system.
S104, generating at least one task link graph based on the incidence relation among the computing task nodes, wherein the task link graph comprises a plurality of task links, and the task links comprise a plurality of computing task nodes connected through links with directivity.
Specifically, the task link map may be a directed acyclic image, and the task link map may be a complete task link or a plurality of task links having a join relationship. In practical application, different research and development teams of an enterprise respectively maintain computing tasks contained in different task links.
Specifically, the calculation task nodes having the association relationship may be connected by a link having directivity.
Each node in the task link graph may correspond to the above-described computation task node one to one. In practical application, the nodes in the task link graph can be shown in the form of numbers, and different numbers represent different computing task nodes. In an alternative embodiment, the number may be a node identifier of a computing task node, and the node identifier may be used to distinguish different computing task nodes.
And S106, determining the terminal nodes of all task links based on the task link graph.
Specifically, the terminal node may be understood as a node at the end of each task link in the task link graph, and different task links may be the same terminal node or different terminal nodes. In practical applications, the terminal node may output a final output table of the data warehouse computing task, where the final output table may be a fact table or a dimension table. The fact table is used for recording the total information of the analyzed content, and comprises specific elements of each event and specific occurrences. The fact table stores the digital type ID and the metric information. The dimension table is description information of elements of the event in the fact table, and the dimension table can be a product dimension, a time dimension, a region dimension, a user dimension, a payment dimension and the like.
And S108, determining a score corresponding to the terminal node based on the node information of the terminal node.
Specifically, the node information may include: the method comprises the following steps of accessing times within preset time, derived times within preset time, access time corresponding to accessing and derived time corresponding to deriving;
in an optional embodiment, the determining, based on the node information of the terminal node, a score corresponding to the terminal node includes:
and determining a score corresponding to the terminal node based on the number of times of access within the preset time, the number of times of derivation within the preset time, the access time corresponding to the access and the derivation time corresponding to the derivation.
Specifically, when the final output table output by the terminal node is accessed or exported, the terminal node may record the corresponding operation accessed or exported, and may also record the time when the corresponding operation occurs.
It will be appreciated that the accessed or derived may characterize the importance of the corresponding terminal node, i.e. the more times it is accessed or derived, the more important the terminal node is in relation to other terminal nodes.
The embodiment of the present specification may determine the score corresponding to the terminal node according to the number of times of access within the preset time, the number of times of derivation within the preset time, the access time corresponding to the access, and the derivation time corresponding to the derivation.
Specifically, the preset time may not be specifically limited in the embodiments of the present specification, and may be set according to actual needs, for example, one month.
In an alternative embodiment, when scoring the terminal node, the corresponding score may be obtained by adding corresponding weights to the accessed and derived values, respectively, and by multiplying the corresponding weights by the accessed and derived values, respectively. Such as score ═ v a + e b. Wherein score is a score of the terminal node, v is the number of times of access within a preset time, a is a weighted value corresponding to the access, e is the derived number of times within the preset time, and b is the derived weighted value corresponding to the derived number of times.
In an alternative embodiment, when the terminal node is scored, the terminal node may also be scored according to the following formula:
score=min((v+e*5),30)/((x/15)^3.5+2)
where x is the difference between the last time the data was accessed or derived and the current date, it is understood that in the embodiments of the present description, the data identified for the importance of the data is historical data, and does not include data of the current day, and therefore x is greater than or equal to 1.
In an alternative embodiment, in order to better identify the importance of different terminal nodes, the score (score) may be normalized to be within the interval of 0-100.
It can be seen that the score for a terminal node is primarily associated with the number of times it is accessed or derived.
S110, determining the score of each calculation task node in the task link graph based on the scores corresponding to all the terminal nodes.
In an optional embodiment, fig. 3 is a schematic flow chart of another data importance identification method provided in the embodiment of the present application, and as shown in fig. 3, the determining scores of each of the calculation task nodes in the task link graph based on scores corresponding to all of the terminal nodes includes:
s202, selecting a node to be evaluated in the task link graph, wherein the node to be evaluated is a computing task node which is not evaluated and is evaluated by a computing task node directly connected with the node to be evaluated at the downstream.
Specifically, after scoring each computation task node in the task link graph, a scoring tag may be added to the scored computation task node to distinguish the scored computation task node. The corresponding score may also be added directly to the corresponding computing task node.
S204, determining the score of the node to be evaluated based on the score of the downstream computing task node directly connected with the node to be evaluated.
S206, reselecting the node to be evaluated, and repeatedly executing: and determining the score of the node to be evaluated based on the score of a downstream computing task node directly connected with the node to be evaluated until each computing task node in the task link graph has a corresponding score.
Specifically, different task links have the same computing task node;
in an optional embodiment, fig. 4 is a schematic flow chart of another data importance identification method provided in an embodiment of the present application, and as shown in fig. 4, the determining the score of the node to be scored based on the score of a downstream computing task node directly connected to the node to be scored includes;
s402, judging whether the nodes to be evaluated exist in different links;
s404, when the nodes to be evaluated exist in different links, obtaining the scores of all the calculation task nodes directly connected with the nodes to be evaluated at the downstream;
and S406, giving the larger score of the scores of all the calculation task nodes directly connected with the node to be evaluated to the node to be evaluated.
Specifically, different task links in the task link graph may have a scene sharing the same computing task node, in order to accurately determine the score of the node to be evaluated, it may be determined in advance whether the node to be evaluated exists in different links, and when the node to be evaluated exists in different links, a larger score of the scores of all the computing task nodes directly connected to the node to be evaluated may be given to the node to be evaluated, for example, a computing task node 4 and a computing task node 5 are directly connected to the downstream of the computing task node 3, the score of the computing task node 4 is 20, the score of the computing task node 5 is 30, and at this time, the score of the computing task node 3 may be determined to be 30. And after the score of the node to be evaluated is determined, the node to be evaluated is reselected, and the score of the node to be evaluated is determined based on the scores of the downstream computing task nodes directly connected with the node to be evaluated repeatedly until each computing task node in the task link diagram has a corresponding score.
It can be understood that the selected node to be scored is determined by the score of the computing task node directly connected with the node downstream.
It is understood that when there is only one computing task node downstream of the node to be scored, the score of the node to be scored may be the score of the downstream computing task node.
And S112, determining important task nodes based on the corresponding scores of the calculation task nodes.
Specifically, the important task node may be determined according to the score of each computing task node.
In an optional embodiment, the determining an important task node based on the corresponding score of each computing task node includes:
and taking the calculation task node with the score larger than a preset first score threshold value as the important task node.
Specifically, the preset first score threshold may be preset according to actual needs, and may be adjustable.
In practical application, the calculation task node with the score preset by the first score threshold value can be used as the important task node.
Or, sequencing the computing task nodes according to the scores from small to small, and taking the nodes sequenced in the previously specified ranking as important task nodes.
Specifically, the determined important task nodes can be important nodes in a current data warehouse, and in order to look up the important task nodes more intuitively, the determined important task nodes can be distributed to corresponding maintenance departments, so that the timely maintenance of the important task nodes is enhanced, and the condition that the stability of enterprise data assets is influenced by offline or freezing of the important task nodes is avoided.
According to the method and the device, each calculation task node in the data warehouse is associated to the corresponding task link graph according to the association relation of each calculation task node in the data warehouse, wherein the task link graph connects the associated calculation task nodes through the directional links, the relation of different calculation task nodes can be visually displayed, the staggered condition of different task links can be integrally checked, and corresponding maintenance personnel can conveniently control the corresponding task links. And then, the terminal nodes of each task link in the task link graph can be scored according to the node information of the calculation task nodes, so that the scores of the calculation task nodes are obtained, important task nodes in the data warehouse can be determined through the scores, maintenance personnel can strengthen management on the important task nodes, and the important task nodes are prevented from being frozen or off-line due to long-term non-maintenance.
On the basis of the above embodiment, in an embodiment of the present specification, the node information further includes a node tag, where the node tag is used to represent the importance degree of the computing task node;
before determining the important task node based on the corresponding score of each computing task node, the method further comprises:
setting the score of the computing task node with the node label as the maximum value of the scores of the computing task nodes.
Specifically, the node label may be directly added to the corresponding computation task node by the user, and the node label may be used to prompt that the computation task node corresponding to the user is an important data node with respect to the data warehouse.
It can be appreciated that computing task nodes existing in a data warehouse are often used infrequently, but the corresponding enterprise is relatively important, and computing task nodes with node labels should be long-term maintenance incapable of being dropped or frozen.
In practical application, the importance of the nodes of the computing task can be highlighted by adding node labels to the nodes.
It will be appreciated that some of the compute task nodes may have node labels therein.
In practical application, after the scores of the respective calculation task nodes are determined by the scores of the downstream calculation task nodes, the scores of the calculation task nodes having the node labels may be set to the maximum value of the scores of the respective calculation task nodes.
The data importance identification method provided by the embodiment of the specification further improves the application scene of the data importance identification method in a mode of being used for the specified important task node, provides closed-loop management for identifying the important task node for a data warehouse, and simultaneously avoids offline or freezing conditions caused by low use frequency of the important calculation task node.
On the basis of the above embodiment, in an embodiment of the present specification, the method further includes: and freezing or taking off the computing task nodes with the scores lower than a preset second score threshold value according to the scores of the computing task nodes.
Specifically, the preset second score threshold is not specifically limited in the embodiments of the present specification, and may be set according to actual needs.
In practical application, the calculation task nodes with the scores lower than the preset second score threshold value can be frozen or taken down. Wherein, freezing or offline can be understood as a computing task node that does not need maintenance management.
According to the data importance identification method provided by the embodiment of the specification, the number of data maintenance of the data warehouse is reduced by freezing or offline the calculation task nodes lower than the preset second score threshold, meanwhile, the system memory is reduced, and the data processing efficiency of the data warehouse is improved.
On the other hand, the present invention provides a data importance identification apparatus, fig. 5 is a schematic structural diagram of the data importance identification apparatus provided in the embodiment of the present invention, and referring to fig. 5, the apparatus may include:
a data obtaining module 801, configured to obtain node information of each computation task node in the data warehouse and an association relationship between each computation task node;
a link map generating module 802, configured to generate at least one task link map based on an association relationship between the computing task nodes, where the task link map includes a plurality of task links, and the task links include a plurality of computing task nodes connected by links with directivity;
a terminal node determining module 803, configured to determine a terminal node of each task link based on the task link map;
a first scoring module 804, configured to determine a score corresponding to the terminal node based on the node information of the terminal node;
a second scoring module 805, configured to determine scores of the respective computation task nodes in the task link graph based on scores corresponding to all the terminal nodes;
and an important task node determining module 806, configured to determine an important task node based on the corresponding score of each computing task node.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
On the other hand, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the present invention provides an electronic device according to the data importance identification method, where the electronic device includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the data importance identification method as described above.
The above functions, if implemented in the form of software functional units and sold or used as a separate product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.
An embodiment of the present invention further provides a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is executable by a processor of an electronic device to perform the data importance identification method described above.
Optionally, in an embodiment of the present invention, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device and the storage medium embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The implementation principle and the generated technical effect of the testing method provided by the embodiment of the invention are the same as those of the system embodiment, and for the sake of brief description, the corresponding contents in the system embodiment can be referred to where the method embodiment is not mentioned.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the above claims.

Claims (10)

1. A data importance identification method, the method comprising:
acquiring node information of each computing task node in a data warehouse and an incidence relation between the computing task nodes;
generating at least one task link graph based on the incidence relation among the computing task nodes, wherein the task link graph comprises a plurality of task links, and the task links comprise a plurality of computing task nodes connected through directional links;
determining a terminal node of each task link based on the task link graph;
determining a score corresponding to the terminal node based on the node information of the terminal node;
determining the score of each calculation task node in the task link graph based on the scores corresponding to all the terminal nodes;
and determining important task nodes based on the corresponding scores of the computing task nodes.
2. The data importance identification method according to claim 1, wherein the determining the score of each computing task node in the task link graph based on the scores corresponding to all the terminal nodes comprises:
selecting a node to be evaluated in the task link graph, wherein the node to be evaluated is a computing task node which is not evaluated and is evaluated by a computing task node directly connected with the node to be evaluated at the downstream;
determining the score of the node to be evaluated based on the score of a downstream computing task node directly connected with the node to be evaluated;
and (3) reselecting the node to be evaluated, and repeatedly executing: and determining the score of the node to be evaluated based on the score of a downstream computing task node directly connected with the node to be evaluated until each computing task node in the task link graph has a corresponding score.
3. The data importance identification method according to claim 1, wherein the node information includes: the method comprises the following steps of accessing times within preset time, derived times within preset time, access time corresponding to accessing and derived time corresponding to deriving;
the determining the score corresponding to the terminal node based on the node information of the terminal node includes:
and determining a score corresponding to the terminal node based on the number of times of access within the preset time, the number of times of derivation within the preset time, the access time corresponding to the access and the derivation time corresponding to the derivation.
4. The data importance identification method according to claim 2, wherein different task links have the same computing task node;
determining the score of the node to be evaluated based on the score of the downstream computing task node directly connected with the node to be evaluated, wherein the score comprises;
judging whether the nodes to be evaluated exist in different links or not;
when the nodes to be evaluated exist in different links, obtaining the scores of all the calculation task nodes directly connected with the nodes to be evaluated at the downstream;
and giving the larger score of the scores of all the calculation task nodes directly connected with the node to be evaluated to the node to be evaluated.
5. The data importance identification method according to claim 3, wherein the node information further comprises a node label, and the node label is used for representing the importance degree of the computing task node;
before determining the important task node based on the corresponding score of each computing task node, the method further comprises:
setting the score of the computing task node with the node label as the maximum value of the scores of the computing task nodes.
6. The data importance identification method according to any one of claims 1 to 5, wherein the determining of the important task node based on the corresponding score of each of the computing task nodes comprises:
taking the computing task node with the score larger than a preset first score threshold value as the important task node;
or, sequencing the computing task nodes according to the scores from small to small, and taking the nodes sequenced in the previously specified ranking as important task nodes.
7. The data importance identification method of claim 6, wherein the method further comprises: and freezing or taking off the computing task nodes with the scores lower than a preset second score threshold value according to the scores of the computing task nodes.
8. An apparatus for identifying importance of data, the apparatus comprising:
the data acquisition module is used for acquiring node information of each computing task node in the data warehouse and an incidence relation between the computing task nodes;
a link map generation module, configured to generate at least one task link map based on an association relationship between the computing task nodes, where the task link map includes a plurality of task links, and the task links include a plurality of computing task nodes connected by links with directivity;
the terminal node determining module is used for determining the terminal node of each task link based on the task link graph;
the first scoring module is used for determining a score corresponding to the terminal node based on the node information of the terminal node;
the second scoring module is used for determining the score of each calculation task node in the task link diagram based on the scores corresponding to all the terminal nodes;
and the important task node determining module is used for determining the important task node based on the corresponding scores of the computing task nodes.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the data importance identification method of any of claims 1 to 7.
10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of a data importance identification apparatus/electronic device, enable the data importance identification apparatus/electronic device to perform the data importance identification method of any one of claims 1 to 7.
CN202111551081.0A 2021-12-17 2021-12-17 Data importance identification method, device, equipment and medium Pending CN114385705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111551081.0A CN114385705A (en) 2021-12-17 2021-12-17 Data importance identification method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111551081.0A CN114385705A (en) 2021-12-17 2021-12-17 Data importance identification method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114385705A true CN114385705A (en) 2022-04-22

Family

ID=81198650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111551081.0A Pending CN114385705A (en) 2021-12-17 2021-12-17 Data importance identification method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114385705A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117213725A (en) * 2023-09-12 2023-12-12 国能龙源环保有限公司 Thermal power plant desulfurization equipment sealing detection method, system, terminal and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117213725A (en) * 2023-09-12 2023-12-12 国能龙源环保有限公司 Thermal power plant desulfurization equipment sealing detection method, system, terminal and storage medium
CN117213725B (en) * 2023-09-12 2024-05-14 国能龙源环保有限公司 Thermal power plant desulfurization equipment sealing detection method, system, terminal and storage medium

Similar Documents

Publication Publication Date Title
US20120232947A1 (en) Automation of business management processes and assets
US10742519B2 (en) Predicting attribute values for user segmentation by determining suggestive attribute values
US10404526B2 (en) Method and system for generating recommendations associated with client process execution in an organization
CN108038655A (en) Recommendation method, application server and the computer-readable recording medium of department's demand
JP2006048702A (en) Automatic configuration of transaction-based performance model
CN110535686B (en) Abnormal event processing method and device
CN115098600A (en) Directed acyclic graph construction method and device for data warehouse and computer equipment
CN114385705A (en) Data importance identification method, device, equipment and medium
CN113034295A (en) Dangerous species recommendation method and device, electronic equipment and storage medium
Overbeck et al. Development and analysis of digital twins of production systems
US11195113B2 (en) Event prediction system and method
Lebedeva et al. Cognitive maps for risk estimation in software development projects
CN114925919A (en) Service resource processing method and device, computer equipment and storage medium
US11288269B2 (en) Optimizing breakeven points for enhancing system performance
Effendi et al. Process discovery of business processes using temporal causal relation
CN114548631A (en) Dynamic evaluation method and device
Ribeiro et al. Improving productive processes using a process mining approach
KR20200127882A (en) Vacation Information and Suggestion System
CN112580915A (en) Project milestone determination method and device, storage medium and electronic equipment
CN109801012A (en) Processing method, device, computer equipment and the storage medium of tank measurements data
Borissova et al. Selection of ERP via cost-benefit analysis under uncertainty conditions
CN110019186A (en) The method and device of data storage
Ortiz et al. State of the art determination of risk management in the implantation process of computing systems
CN112036789B (en) Salary statistical method and device based on warehousing data and computer equipment
US20230359975A1 (en) Automated esg mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination