CN115630053A - Data complementing method, device, equipment, storage medium and product - Google Patents

Data complementing method, device, equipment, storage medium and product Download PDF

Info

Publication number
CN115630053A
CN115630053A CN202211193733.2A CN202211193733A CN115630053A CN 115630053 A CN115630053 A CN 115630053A CN 202211193733 A CN202211193733 A CN 202211193733A CN 115630053 A CN115630053 A CN 115630053A
Authority
CN
China
Prior art keywords
attribute value
missing object
missing
determining
effective path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211193733.2A
Other languages
Chinese (zh)
Inventor
解皖栋
杨萍萍
蔡科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202211193733.2A priority Critical patent/CN115630053A/en
Publication of CN115630053A publication Critical patent/CN115630053A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data completion method, a data completion device, data completion equipment, a storage medium and a product, and relates to the technical field of big data intelligent analysis. The method comprises the steps of creating a bipartite graph according to object identification, attribute identification and non-missing object attribute values in a current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to the weight of a connecting edge of the bipartite graph; determining a missing object attribute value of a current data table, and determining at least one effective path corresponding to the missing object attribute value in a bipartite graph, wherein the effective path is a path meeting a set weight condition and comprises an object identifier corresponding to the missing object attribute value and at least three connecting edges connecting the attribute identifier; and determining a completion attribute value of the missing object according to the set weight of the at least one effective path. The problem that the existing data completion method is poor in flexibility is solved, and the flexibility of data completion is improved.

Description

Data complementing method, device, equipment, storage medium and product
Technical Field
The invention relates to the technical field of big data intelligent analysis, in particular to a data completion method, a device, equipment, a storage medium and a product.
Background
For the case of large-scale data loss, data completion is mainly performed through a trained data completion model at present. The data completion model mainly comprises a statistical model and a deep learning model based on feature estimation. The former tends to make strong assumptions about data distribution, and lacks flexibility in handling mixed data types, such as mixed data including continuous variables and discrete variables; the latter is not flexible in determining missing values from other observations, and is typically initialized with special default values during model training, which makes biased assumptions about missing values. Moreover, when new data samples are encountered, both the two data completion models need to be retrained, and the flexibility is poor.
In summary, the existing data completion method has the problem of poor flexibility.
Disclosure of Invention
The invention provides a data completion method, a device, equipment, a storage medium and a product, which aim to solve the problem of poor flexibility of the existing data completion method.
According to an aspect of the present invention, there is provided a data completion method, including:
creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to a weight of a connecting edge of the bipartite graph;
determining a missing object attribute value of the current data table, and determining at least one effective path corresponding to the missing object attribute value in the bipartite graph, wherein the effective path is a path meeting a set weight condition and comprises at least three connecting edges connecting an object identifier and an attribute identifier corresponding to the missing object attribute value;
and determining a completion attribute value of the missing object according to the set weight of the at least one effective path.
According to another aspect of the present invention, there is provided a data complementing apparatus, including:
the bipartite graph module is used for creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to a weight of a connecting edge of the bipartite graph;
an effective path module, configured to determine a missing object attribute value of the current data table, and determine at least one effective path corresponding to the missing object attribute value in the bipartite graph, where the effective path is a path that meets a set weight condition, and the path includes at least three connecting edges that connect an object identifier and an attribute identifier corresponding to the missing object attribute value;
and the completion module is used for determining the completion attribute value of the missing object according to the set weight of the at least one effective path.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the data completion method of any of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a data completion method according to any one of the embodiments of the present invention when the computer instructions are executed.
According to another aspect of the present invention, a computer program product is provided, the computer program product comprising a computer program which, when executed by a processor, implements the data completion method of any of the embodiments.
Compared with the prior art, the technical scheme of the embodiment of the invention determines the effective path of the attribute value of the missing object based on the bipartite graph, thereby improving the flexibility of determining the effective path; because the weight values of the connecting edges of the effective paths correspond to the attribute values of the non-missing objects, the completion attribute values corresponding to the attribute values of the missing objects are determined through the set weight values of all the effective paths corresponding to the attribute values of the missing objects, the technical effect of determining the completion attribute values of the missing objects based on the incidence relation between the attribute values of the missing objects and the attribute values of the non-missing objects is achieved, and the flexibility, the accuracy and the universality of data completion are improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow diagram of data completion provided according to an embodiment of the invention;
FIG. 2 is a bipartite graph corresponding to Table 1 provided according to an embodiment of the invention;
FIG. 3 is a flow chart of yet another data completion provided in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart of yet another data completion provided in accordance with an embodiment of the present invention;
FIG. 5 is a flow diagram providing yet another data completion according to an embodiment of the present invention;
FIG. 6 is a block diagram of a data completion apparatus according to an embodiment of the present invention;
FIG. 7 is a block diagram illustrating an exemplary embodiment of a data completion apparatus;
fig. 8 is a schematic structural diagram of an electronic device implementing an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some structures related to the present invention are shown in the drawings, not all of them.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.
Examples
Fig. 1 is a flowchart of a data completion method according to an embodiment of the present invention, where the embodiment is applicable to a case where data completion of a missing object attribute value in a data table is completed according to an effective path of the missing object attribute value determined by a bipartite graph corresponding to the data table. As shown in fig. 1, the method includes:
s110, creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to the weight of the connecting edge of the bipartite graph.
The object identifier is an analysis object identifier, such as a customer identifier, an employee identifier, an engineering project identifier, and the like;
the attribute identifier is a statistical item identifier of the object, such as a telephone charge identifier, an electric charge identifier, a water charge identifier, a diet consumption identifier, and the like.
The non-missing object attribute value is an object attribute value explicitly recorded in the current data table.
The bipartite graph is also called a bipartite graph, a vertex set V formed by all vertices of the bipartite graph can be divided into two mutually disjoint subsets, and two vertices attached to each edge in the graph belong to the two mutually disjoint subsets, and the vertices in the two subsets are not adjacent. In this embodiment, the two mutually disjoint subsets are respectively an object node subset corresponding to all object identifiers and an attribute node subset corresponding to all attribute identifiers in the current data table.
In this embodiment, each connection edge of the bipartite graph corresponds to a weight, and the weight is equal to the attribute value of the corresponding non-missing object.
It can be understood that, when a missing object attribute value exists in the current data table, in the bipartite graph, there is no connecting edge between the object identifier and the attribute identifier corresponding to the missing object attribute value, and the missing object attribute value refers to an object attribute value that is not described in the current data table. Setting a data table 1 as a current data table, wherein S1, S2 and S3 are object identifiers, F1, F2, F3 and F4 are attribute identifiers, and values corresponding to intersection points of the object identifiers and the attribute identifiers are object attribute values. The bipartite graph of Table 1 is shown in FIG. 2.
TABLE 1
F1 F2 F3 F4
S1 0.4 0.5 NA 0.2
S2 NA NA 0.7 0.2
S3 0.4 NA NA 0.5
Comparing the bipartite graphs in the data table 1 and fig. 2, it can be found that the connection edges in fig. 2 correspond to the attribute values of the non-missing objects in the data table 1 one to one, and the weight values of the connection edges in fig. 2 are equal to the corresponding attribute values of the non-missing objects; whereas the missing object attribute values in table 1 have no corresponding connecting edges in the bipartite graph in fig. 2.
S120, determining the attribute value of the missing object of the current data table, and determining at least one effective path corresponding to the attribute value of the missing object in the bipartite graph, wherein the effective path is a path meeting the condition of a set weight value and comprises at least three connecting edges connecting the object identifier corresponding to the attribute value of the missing object and the attribute identifier.
Wherein, the weight setting condition is as follows: and the difference between the weights of two adjacent connecting edges of the node corresponding to each attribute identifier is within a set threshold range. In one embodiment, the set threshold range may be selected to be 15% or 20%.
Determining a path corresponding to the attribute value of the missing object by the following steps:
step a1, determining an object node and an attribute node corresponding to an object identifier and an attribute identifier corresponding to a missing object attribute value in a bipartite graph, and taking the object node as a starting object node and the attribute node as an ending attribute node.
And a2, searching all paths connecting the starting object node and the ending attribute node in the bipartite graph, wherein any path comprises at least three connecting edges.
Since the path is a non-straight connection path between the starting object node and the ending attribute node, it needs to pass through another object node, and thus the path includes at least three connection edges.
Any path corresponding to the attribute value of the missing object determined by the steps does not include a closed loop, so that any path can reach the attribute node of the ending object along the starting object node uniquely.
For example, the threshold range is set to be less than or equal to 20%. As shown in FIG. 2, for the missing object attribute values (S3, F3), there are two paths, path A (S3-F4-S2-F3) and path B (S3-F1-S1-F4-S2-F3). The path A and the path B both comprise an attribute node F4' corresponding to the attribute identifier F4, the weights of the first connecting edge (S3-F4) and the second connecting edge (F4-S2) of the attribute node are respectively 0.5 and 0.2, and the error between the first connecting edge and the second connecting edge is 0.5
Figure BDA0003869985100000061
This error is not within the set threshold range and thus path a is not a valid path. The path B comprises an attribute node F1 'corresponding to the attribute identifier F1 and an attribute node F4' corresponding to the attribute identifier F4, the weights of the first connecting edge (S3-F1) and the second connecting edge (F1-S1) of the attribute node F1 are respectively 0.4 and 0.4, and the error between the first connecting edge and the second connecting edge is
Figure BDA0003869985100000062
The error is within a set threshold range; the first connecting edges (S1-F4) and the second connecting edges (F4-S2) of the attribute node F4' have the weight values of 0.2 and 0.2 respectively, and the error between the first connecting edges and the second connecting edges is
Figure BDA0003869985100000063
The error is inWithin the threshold range, path B is therefore the valid path. In this embodiment, the error is a percentage of the ratio of the difference between the larger weight and the smaller weight to the smaller weight. It can be understood that the error may also be defined in other forms, such as a percentage of a difference between a weight of the first connection edge and a weight of the second connection edge to a ratio of the weight of the first connection edge, or a percentage of a difference between a weight of the first connection edge and a weight of the second connection edge to a ratio of a weight of the second connection edge, which may be set according to specific situations during actual use, where the first connection edge and the second connection edge are located on two sides of the attribute node, the end node of the first connection edge is the attribute node, and the start node of the second connection edge is the attribute node.
S130, determining a completion attribute value of the missing object according to the set weight value of the at least one effective path.
Because the weight of the connection edge of the effective path is the corresponding non-missing object attribute value, the complement value of the attribute value of the missing object is determined based on the set weight of the effective path corresponding to the attribute value of the missing object, that is, the complement value of the attribute value of the missing object is determined based on the set non-missing object attribute value associated with the effective path, that is, the complement attribute value of the missing object is determined by adopting the association relationship between the attribute value of the missing object and the attribute value of the non-missing object, thereby improving the flexibility and universality of the complement of the attribute value of the missing object.
Compared with the prior art, the technical scheme of the embodiment of the invention determines the effective path of the attribute value of the missing object based on the bipartite graph, thereby improving the flexibility of determining the effective path; because the weight of the connecting edge of the effective path corresponds to the attribute value of the non-missing object, the completion attribute value corresponding to the attribute value of the missing object is determined through the set weights of all effective paths corresponding to the attribute value of the missing object, the technical effect of determining the completion attribute value of the missing object based on the incidence relation between the attribute value of the missing object and the attribute value of the non-missing object is achieved, and the flexibility, the accuracy and the universality of data completion are improved.
Fig. 3 is a flowchart of a data completion method according to another embodiment of the present invention. As shown in fig. 3, compared to the foregoing embodiment, the data completion method in this embodiment adds "update the current data table," and update the bipartite graph according to the updated current data table; returning to the step of determining the attribute value of the missing object of the current data table and determining at least one effective path corresponding to the attribute value of the missing object in the bipartite graph until no attribute value of the missing object exists in the current data table or no effective path exists for the attribute value of the missing object in the current data table, wherein the method comprises the following steps:
s210, creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to the weight of the connecting edge of the bipartite graph.
S220, determining the attribute value of the missing object of the current data table, and determining at least one effective path corresponding to the attribute value of the missing object in the bipartite graph, wherein the effective path is a path meeting the condition of a set weight value and comprises at least three connecting edges connecting the object identifier corresponding to the attribute value of the missing object and the attribute identifier.
And S230, determining a completion attribute value of the missing object according to the set weight of the at least one effective path.
S240, updating the current data table, and updating the bipartite graph according to the updated current data table; and returning to the step of determining the attribute value of the missing object of the current data table and determining at least one effective path corresponding to the attribute value of the missing object in the bipartite graph until the attribute value of the missing object in the current data table or the attribute value of the missing object in the current data table has no effective path.
This step is intended to achieve progressive data completion of the data sheet, the progressiveness of which can be embodied by the following steps:
and b1, updating the current data table, and updating the bipartite graph according to the updated current data table.
And b2, determining whether the updated current data table has the attribute value of the missing object, if not, executing the step b3, and if so, executing the step b4.
And b3, taking the updated current data table as the current data table after data completion.
And b4, determining whether the attribute value of each missing object corresponds to an effective path in the updated bipartite graph.
Step b5, if not, ending; if yes, determining the completion attribute value of each missing object attribute value based on the set weight values of all effective paths corresponding to the missing object attribute values, and returning to the step b1.
It can be understood that, for the current data table, there may be one or more missing object attribute values that have no valid path in the bipartite graph corresponding to the current data table, and the current data table is subjected to one or more rounds of data completion to obtain an updated current data table, where a part or all of the missing object attribute values in the one or more missing object attribute values have valid paths in the bipartite graph corresponding to the updated current data table, and at this time, the completed attribute values of the part or all missing object attribute values may be determined according to valid paths respectively corresponding to the part or all missing object attribute values.
Compared with the prior art, the data completion method provided by the embodiment of the invention improves the possibility of completing the attribute value of the missing object in the current data table and the accuracy of the corresponding completed attribute value in a progressive data completion mode.
Fig. 4 is a flowchart of a data completion method according to another embodiment of the present invention. This embodiment is used to further optimize the "determining the attribute value of the missing object of the current data table" in the foregoing embodiment. Accordingly, this embodiment includes:
s310, creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to the weight of the connecting edge of the bipartite graph.
S3201, determining all attribute values of the missing objects of the current data table, and determining the effective path number of each attribute value of the missing objects and the effective path confidence corresponding to the effective path number based on the bipartite graph.
Determining all effective paths corresponding to the attribute values of the missing objects in the bipartite graph, counting the number of the effective paths corresponding to the attribute values of the missing objects respectively, and taking the number of the effective paths as the confidence degrees of the effective paths, thereby obtaining the confidence degrees of the effective paths corresponding to the attribute values of the missing objects.
In the process of determining the completion attribute value of the missing object, the more the effective paths are, the more the setting weights of the effective paths can be referred to, the more the attribute values of the non-missing object can be referred to, and the higher the accuracy of the determined completion attribute value is. Therefore, the effective path confidence corresponding to the attribute value of the missing object in the step can reflect the number of the attribute values of the non-missing object which can be referred to by the attribute value of the missing object in the completion process and the accuracy of the corresponding completion attribute value.
S3202, performing descending sorting on the confidence degrees of all the attribute values of the missing objects, and taking the attribute value of the missing object corresponding to the confidence degree of the preset number in the descending sorting result as at least one attribute value of the missing object in the current data table.
The attribute value of the missing object corresponding to the confidence coefficient of the previous set number in the descending sorting result is used as at least one attribute value of the missing object in the current data table, so that on one hand, the accuracy of the completion attribute value determined by the current round can be ensured; on the other hand, the attribute values of other missing objects in the descending sorting result can enter the data completion of the subsequent round, and after the data completion of the current round is completed, the number of effective paths corresponding to the attribute values of other missing objects can be increased, and the increase of the effective paths can improve the accuracy of the completion attribute value of the corresponding missing object, so that the accuracy of the whole data completion of the current data table is improved.
S3203, determining the attribute value of the missing object of the current data table, and determining at least one effective path corresponding to the attribute value of the missing object in the bipartite graph, wherein the effective path is a path meeting the condition of a set weight, and the path comprises at least three connecting edges connecting the object identifier corresponding to the attribute value of the missing object and the attribute identifier.
S330, determining a completion attribute value of the missing object according to the set weight of the at least one effective path.
Compared with the prior art, according to the technical scheme of the embodiment, the attribute value of the missing object entering the current round of data completion is selected based on the confidence coefficient of the effective path, so that the accuracy of the current round of data completion is improved, the number of effective paths of the attribute values of part or all of the missing objects which do not enter the current round of data completion is increased, and the accuracy of the data completion of the attribute values of the part or all of the missing objects and the accuracy of the whole data completion of the current data table are improved.
Fig. 5 is a flowchart of a data completion method according to another embodiment of the present invention. This embodiment is used to further optimize the "determining the complementary attribute value of the missing object according to the weight and the correlation of the associated connecting edge of the at least one effective path" in the foregoing embodiment. Accordingly, this embodiment includes:
s410, creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to the weight of the connecting edge of the bipartite graph.
S420, determining the attribute value of the missing object of the current data table, and determining at least one effective path corresponding to the attribute value of the missing object in the bipartite graph, wherein the effective path is a path meeting the condition of a set weight value and comprises at least three connecting edges connecting the object identifier corresponding to the attribute value of the missing object and the attribute identifier.
S4301, taking the tail connecting edge of the effective path as the associated connecting edge of the effective path, wherein one node of the tail connecting edge corresponds to the attribute identification corresponding to the attribute value of the missing object.
The associated connecting edge is a connecting edge whose weight contributes to the determination of the missing object attribute value.
Since the path comprises three connecting edges, the effective path selected from the path also comprises at least three connecting edges. In this embodiment, a connection edge corresponding to the end attribute node is referred to as an end connection edge, and the end connection edge is used as an associated connection edge of the effective path.
And S4302, respectively determining the correlation degree of the associated connecting edge of at least one effective path corresponding to the missing object attribute value, wherein the correlation degree is used for representing the association strength of the corresponding associated connecting edge and the missing object attribute value.
In this embodiment, if the number of the connection edges included in the effective path is 3 (or the number of the included attribute nodes is 1), it is determined that the associated connection edge is a direct associated connection edge of the missing object attribute value, and the association degree between the associated connection edge and the corresponding missing object attribute value is the maximum. If the number of the connection edges included in the effective path is greater than 3 (or the number of the included attribute nodes is greater than 2), the associated connection edge is determined to be an indirect associated connection edge of the missing object attribute value, and the greater the number of the connection edges or the attribute nodes of the effective path is, the smaller the association degree of the associated connection edge with the corresponding missing object attribute value is.
In one embodiment, the depths of at least one effective path corresponding to the attribute values of the missing objects are respectively determined; and determining the correlation degree of the corresponding at least one associated connecting edge according to the depth of the at least one effective path. Specifically, the degree of correlation of the associated connecting edges may be determined by:
and c1, taking the reciprocal of the depth of each effective path as the initial correlation of the corresponding associated connecting edge.
And c2, normalizing the initial correlation degree of the at least one associated connecting edge to obtain the correlation degree of the at least one associated connecting edge.
The inverse of the depth of the effective path is used as the initial correlation degree of the corresponding associated connecting edge, so that the associated connecting edge corresponding to the effective path comprising fewer connecting edges has larger initial correlation degree, and the associated connecting edge corresponding to the effective path comprising more connecting edges has smaller initial correlation degree. The initial correlation degrees of the at least one associated connecting side are normalized, so that the sum of the correlation degrees corresponding to the at least one associated connecting side is 1.
S440, determining a completion attribute value of the missing object according to the weight and the correlation of the associated connecting edge corresponding to the attribute value of the missing object.
Taking the correlation degree of the associated connecting edge as the weight of the associated connecting edge; and carrying out weighted summation on the weight of the corresponding at least one associated connecting edge according to the weight of the corresponding at least one associated connecting edge to obtain a corresponding completion attribute value.
It can be understood that, since the weight of the connection edge of the bipartite graph is the corresponding non-missing object attribute value, the weight of the corresponding at least one associated connection edge is weighted and summed according to the weight of the corresponding at least one associated connection edge to obtain the corresponding completion attribute value, that is, the weight of the corresponding at least one non-missing object attribute value is weighted and summed according to the weight of the corresponding at least one non-missing object attribute value to obtain the corresponding completion attribute value. The technical effect of determining the completion attribute value of the missing object based on at least one non-missing object attribute value associated with the attribute value of the missing object is achieved.
Compared with the prior art, in the technical scheme of this embodiment, the corresponding completion attribute value is obtained by performing weighted summation on the weight of the corresponding at least one associated connection edge according to the weight of the corresponding at least one associated connection edge, so that the purpose of obtaining the corresponding completion attribute value by performing weighted summation on the corresponding at least one non-missing object attribute value according to the weight of the corresponding at least one non-missing object attribute value is achieved, and the technical effect of determining the completion attribute value of the missing object attribute value based on the at least one non-missing object attribute value associated with the missing object attribute value is achieved.
Fig. 6 is a block diagram of a data completion apparatus according to an embodiment of the present invention. As shown in fig. 6, the apparatus includes:
a bipartite graph module 11, configured to create a bipartite graph according to the object identifier, the attribute identifier, and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to a weight of a connecting edge of the bipartite graph;
an effective path module 12, configured to determine a missing object attribute value of the current data table, and determine at least one effective path corresponding to the missing object attribute value in the bipartite graph, where the effective path is a path that meets a set weight condition, and the path includes at least three connecting edges that connect an object identifier and an attribute identifier corresponding to the missing object attribute value;
and a completion module 13, configured to determine a completion attribute value of the missing object according to the set weight of the at least one effective path.
In one embodiment, as shown in fig. 7, the apparatus further comprises:
a returning module 14, configured to update the current data table, and update the bipartite graph according to the updated current data table; and returning to the step of determining the missing object attribute value of the current data table and determining at least one effective path corresponding to the missing object attribute value in the bipartite graph until no missing object attribute value exists in the current data table or no effective path exists in the missing object attribute value in the current data table.
In one embodiment, the completion module includes:
an associated connecting edge determining unit, configured to use an end connecting edge of the effective path as an associated connecting edge of the effective path, where a node of the end connecting edge corresponds to the attribute identifier corresponding to the missing object attribute value;
a correlation determining unit, configured to determine correlations of associated connection edges of the at least one effective path corresponding to the missing object attribute value, where the correlations are used to represent the correlation strengths of the corresponding associated connection edges and the missing object attribute value;
and the completion unit is used for determining the completion attribute value of the missing object according to the weight and the correlation of the associated connecting edge of the at least one effective path.
In one embodiment, the relevance determining unit is configured to determine depths of the at least one valid path corresponding to the missing object attribute values, respectively; and determining the correlation degree of the corresponding at least one associated connecting edge according to the depth of the at least one effective path.
In an embodiment, the correlation determination unit is specifically configured to use an inverse of a depth of each effective path as an initial correlation of a corresponding associated connecting edge; and normalizing the initial correlation degree of the at least one associated connecting edge to obtain the correlation degree of the at least one associated connecting edge.
In one embodiment, the completion module is configured to use the correlation of the associated connecting edge as the weight of the associated connecting edge; and according to the weight of the associated connecting edge of the at least one effective path, carrying out weighted summation on the weight of the associated connecting edge of the at least one effective path so as to obtain a completion attribute value of the missing object.
In one embodiment, all of the effect path modules include:
a missing value determining unit, configured to determine all missing object attribute values of the current data table, and determine, based on the bipartite graph, an effective path number of each missing object attribute value and an effective path confidence corresponding to the effective path number; and performing descending sorting on the confidence degrees of all the attribute values of the missing objects, and taking the attribute value of the missing object corresponding to the confidence degrees of the preset number in the descending sorting result as at least one attribute value of the missing object in the current data table.
In one embodiment, the weight setting condition is: and the difference between the weights of two adjacent connecting edges of the nodes corresponding to the attribute identifications is within a set threshold range.
In one embodiment, the active path module includes:
a path determining unit, configured to determine, in the bipartite graph, an object node and an attribute node corresponding to the object identifier and the attribute identifier corresponding to the missing object attribute value, respectively, and use the object node as a starting object node and the attribute node as an ending attribute node; searching all paths connecting the starting object node and the ending attribute node in the bipartite graph, wherein any path comprises at least three connecting edges.
Compared with the prior art, the technical scheme of the data completion device provided by the embodiment of the invention determines the effective path of the attribute value of the missing object based on the bipartite graph, thereby improving the flexibility of determining the effective path; because the weight values of the connecting edges of the effective paths correspond to the attribute values of the non-missing objects, the completion attribute values corresponding to the attribute values of the missing objects are determined through the set weight values of all the effective paths corresponding to the attribute values of the missing objects, the technical effect of determining the completion attribute values of the missing objects based on the incidence relation between the attribute values of the missing objects and the attribute values of the non-missing objects is achieved, and the flexibility, the accuracy and the universality of data completion are improved.
According to the technical scheme of the embodiment of the invention, the data completion device provided by the embodiment of the invention can execute the data completion method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
FIG. 8 shows a schematic block diagram of an electronic device 20 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 8, the electronic device 20 includes at least one processor 21, and a memory communicatively connected to the at least one processor 21, such as a Read Only Memory (ROM) 22, a Random Access Memory (RAM) 23, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 21 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 22 or the computer program loaded from the storage unit 28 into the Random Access Memory (RAM) 23. In the RAM 23, various programs and data necessary for the operation of the electronic apparatus 20 can also be stored. The processor 12, the ROM 22, and the RAM 23 are connected to each other via a bus 24. An input/output (I/O) interface 25 is also connected to bus 24.
A number of components in the electronic device 20 are connected to the I/O interface 25, including: an input unit 26 such as a keyboard, a mouse, etc.; an output unit 27 such as various types of displays, speakers, and the like; a storage unit 28, such as a magnetic disk, optical disk, or the like; and a communication unit 29 such as a network card, modem, wireless communication transceiver, etc. The communication unit 29 allows the electronic device 20 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 21 may be any of various general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 21 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 21 performs the various methods and processes described above, such as the data completion method.
In some embodiments, the data completion method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 28. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 20 via the ROM 22 and/or the communication unit 29. When the computer program is loaded into the RAM 23 and executed by the processor 21, one or more steps of the data completion method described above may be performed. Alternatively, in other embodiments, the processor 21 may be configured to perform the data completion method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of at least one programming language. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on at least one wire, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
An embodiment of the present invention further provides a computer program product, including a computer program, where the computer program, when executed by a processor, implements the data completion method provided in any embodiment of the present application.
Computer program product in implementing the computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (17)

1. A data completion method, comprising:
creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to a weight of a connecting edge of the bipartite graph;
determining a missing object attribute value of the current data table, and determining at least one effective path corresponding to the missing object attribute value in the bipartite graph, wherein the effective path is a path meeting a set weight condition and comprises at least three connecting edges for connecting an object identifier corresponding to the missing object attribute value and an attribute identifier;
and determining a completion attribute value of the missing object according to the set weight of the at least one effective path.
2. The method according to claim 1, wherein after determining the complementary attribute value of the missing object according to the set weight of the at least one valid path, the method further comprises:
updating the current data table, and updating the bipartite graph according to the updated current data table; and returning to the step of determining the missing object attribute value of the current data table and determining at least one effective path corresponding to the missing object attribute value in the bipartite graph until no missing object attribute value exists in the current data table or no effective path exists in the missing object attribute value in the current data table.
3. The method according to claim 1, wherein the determining the complementary attribute value of the missing object according to the set weight of the at least one valid path comprises:
taking the tail connecting edge of the effective path as the associated connecting edge of the effective path, wherein one node of the tail connecting edge corresponds to the attribute identification corresponding to the attribute value of the missing object;
respectively determining the correlation degree of the associated connecting edge of the at least one effective path corresponding to the missing object attribute value, wherein the correlation degree is used for representing the association strength of the corresponding associated connecting edge and the missing object attribute value;
and determining a completion attribute value of the missing object according to the weight and the correlation of the associated connecting edge of the at least one effective path.
4. The method according to claim 3, wherein the determining the correlation of the associated connection edges of the at least one valid path corresponding to the missing object attribute value respectively comprises:
determining the depth of the at least one effective path corresponding to the attribute value of the missing object respectively;
and determining the correlation degree of the corresponding at least one associated connecting edge according to the depth of the at least one effective path.
5. The method according to claim 4, wherein determining the correlation of the corresponding at least one associated connecting edge according to the depth of the at least one effective path comprises:
taking the reciprocal of the depth of each effective path as the initial correlation degree of the corresponding associated connecting edge;
and normalizing the initial correlation degree of the at least one associated connecting edge to obtain the correlation degree of the at least one associated connecting edge.
6. The method according to claim 3, wherein the determining the complementary attribute value of the missing object attribute value according to the weight and the correlation of the associated connecting edge of the at least one valid path comprises:
taking the correlation degree of the associated connecting edge as the weight of the associated connecting edge;
and according to the weight of the associated connecting edge of the at least one effective path, carrying out weighted summation on the weight of the associated connecting edge of the at least one effective path so as to obtain a completion attribute value of the missing object.
7. The method of claim 1, wherein determining the missing object attribute value for the current data table comprises:
determining all missing object attribute values of the current data table, and determining the effective path number of each missing object attribute value and the effective path confidence corresponding to the effective path number based on the bipartite graph;
and performing descending sorting on the confidence degrees of all the attribute values of the missing objects, and taking the attribute value of the missing object corresponding to the confidence degrees of the preset number in a descending sorting result as at least one attribute value of the missing object in the current data table.
8. The method of claim 1, wherein the weight setting condition is:
and the difference between the weights of two adjacent connecting edges of the corresponding nodes of each attribute identifier is within a set threshold range.
9. The method according to any one of claims 1-8, wherein determining a path corresponding to the missing object attribute value comprises;
determining an object node and an attribute node which respectively correspond to the object identifier and the attribute identifier corresponding to the attribute value of the missing object in the bipartite graph, and taking the object node as a starting object node and the attribute node as an ending attribute node;
searching all paths connecting the starting object node and the ending attribute node in the bipartite graph, wherein any path comprises at least three connecting edges.
10. A data complementing device, comprising:
the bipartite graph module is used for creating a bipartite graph according to the object identification, the attribute identification and the non-missing object attribute value in the current data table; the object identification and the attribute identification correspond to nodes in the bipartite graph, and the non-missing object attribute value corresponds to a weight of a connecting edge of the bipartite graph;
an effective path module, configured to determine a missing object attribute value of the current data table, and determine at least one effective path corresponding to the missing object attribute value in the bipartite graph, where the effective path is a path that meets a set weight condition, and the path includes at least three connecting edges that connect an object identifier and an attribute identifier corresponding to the missing object attribute value;
and the completion module is used for determining the completion attribute value of the missing object according to the set weight of the at least one effective path.
11. The apparatus of claim 10, further comprising:
the return module is used for updating the current data table and updating the bipartite graph according to the updated current data table; and returning to the step of determining the missing object attribute value of the current data table and determining at least one effective path corresponding to the missing object attribute value in the bipartite graph until no missing object attribute value exists in the current data table or no effective path exists in the missing object attribute value in the current data table.
12. The apparatus of claim 10, wherein the completion module comprises:
an associated connecting edge determining unit, configured to use an end connecting edge of the effective path as an associated connecting edge of the effective path, where a node of the end connecting edge corresponds to the attribute identifier corresponding to the missing object attribute value;
a correlation determining unit, configured to determine the correlation of the associated connection edge of the at least one effective path corresponding to the missing object attribute value, where the correlation is used to indicate the strength of association between the corresponding associated connection edge and the missing object attribute value;
and the completion unit is used for determining the completion attribute value of the missing object according to the weight and the correlation of the associated connecting edge of the at least one effective path.
13. The apparatus according to claim 12, wherein the correlation determining unit includes:
determining the depth of the at least one effective path corresponding to the attribute value of the missing object respectively;
and determining the correlation degree of the corresponding at least one associated connecting edge according to the depth of the at least one effective path.
14. The apparatus of claim 10, wherein the active path module comprises:
a missing attribute value determination unit, configured to determine all missing object attribute values of the current data table, and determine, based on the bipartite graph, an effective path number of each missing object attribute value and an effective path confidence corresponding to the effective path number;
and performing descending sorting on the confidence degrees of all the attribute values of the missing objects, and taking the attribute value of the missing object corresponding to the confidence degrees of the preset number in the descending sorting result as at least one attribute value of the missing object in the current data table.
15. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the data completion method of any one of claims 1-9.
16. A computer-readable storage medium having stored thereon computer instructions for causing a processor to execute the method of data completion of any one of claims 1-9.
17. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the data completion method according to any one of claims 1-9.
CN202211193733.2A 2022-09-28 2022-09-28 Data complementing method, device, equipment, storage medium and product Pending CN115630053A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211193733.2A CN115630053A (en) 2022-09-28 2022-09-28 Data complementing method, device, equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211193733.2A CN115630053A (en) 2022-09-28 2022-09-28 Data complementing method, device, equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN115630053A true CN115630053A (en) 2023-01-20

Family

ID=84903972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211193733.2A Pending CN115630053A (en) 2022-09-28 2022-09-28 Data complementing method, device, equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN115630053A (en)

Similar Documents

Publication Publication Date Title
CN113904943B (en) Account detection method and device, electronic equipment and storage medium
CN114037059A (en) Pre-training model, model generation method, data processing method and data processing device
CN114860411B (en) Multi-task learning method, device, electronic equipment and storage medium
CN115186738B (en) Model training method, device and storage medium
CN116228301A (en) Method, device, equipment and medium for determining target user
CN115630053A (en) Data complementing method, device, equipment, storage medium and product
CN112860626A (en) Document sorting method and device and electronic equipment
CN114037058B (en) Pre-training model generation method and device, electronic equipment and storage medium
CN114037057B (en) Pre-training model generation method and device, electronic equipment and storage medium
CN115511014B (en) Information matching method, device, equipment and storage medium
CN114866437B (en) Node detection method, device, equipment and medium
CN116127948B (en) Recommendation method and device for text data to be annotated and electronic equipment
CN116244413B (en) New intention determining method, apparatus and storage medium
CN115859300A (en) Vulnerability detection method and device, electronic equipment and storage medium
CN117993478A (en) Model training method and device based on bidirectional knowledge distillation and federal learning
CN118227580A (en) Log analysis method and device, electronic equipment and storage medium
CN114942996A (en) Triple construction method and device of vertical industry data, electronic equipment and medium
CN118170924A (en) Knowledge base evaluation method and device and electronic equipment
CN116304075A (en) Target person matching method, device, equipment and medium based on knowledge graph
CN114418123A (en) Model noise reduction method and device, electronic equipment and storage medium
CN118051670A (en) Service recommendation method, device, equipment and medium
CN114936802A (en) Index system construction method and device, storage medium and electronic equipment
CN117591576A (en) Overlapping community dividing method, device, equipment and medium based on node similarity
CN117933353A (en) Reinforced learning model training method and device, electronic equipment and storage medium
CN117033235A (en) Method, device, equipment and storage medium for testing relevance of software program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination