CN116975543A

CN116975543A - Data processing method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN116975543A
Application number: CN202310588569.3A
Authority: CN
Inventors: 石志林
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-10-31

Abstract

The embodiment of the invention discloses a data processing method, a data processing device and a computer readable storage medium; after a data set to be processed is obtained, an attribute subset in an attribute set corresponding to the data set to be processed is used as a node to construct an attribute network, based on the attribute network, statistics values of attribute value combinations corresponding to the node are counted in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subsets, then statistics values of preset attribute value combinations in the data set to be processed are predicted according to the data labels, a target attribute subset is screened out from the candidate attribute subsets based on the predicted statistics values, and attribute data corresponding to each attribute in the data set to be processed is adjusted according to the data labels of the target attribute subset to obtain the target data set; the scheme can improve the accuracy of data processing in data calculation. The embodiment of the invention can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like.

Description

Data processing method, device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a data processing method, apparatus, electronic device, and computer readable storage medium.

Background

In recent years, with the rapid development of internet technology, data-driven algorithms and decision systems are increasingly used. Some data have unreasonable characteristic distribution based on the data processing due to the difference of the acquisition mode and the user distribution, so that a model or a decision system which is trained by the characteristics has serious deviation or unfair phenomenon in decision making. In order to mitigate data misuse and reduce the bias of the algorithm, the data needs to be processed. Since statistics of feature selection patterns (attribute value combinations) are the core of describing features in data, particularly in determining data suitability and eliminating data bias. Therefore, in the existing data processing method, different attribute value combinations are often constructed in a data set, statistical information of the attribute value combinations is counted, and finally the data set is processed based on the statistical information, so that the data set with higher accuracy and reliability is obtained.

In the course of research and practice of the prior art, the inventors of the present application found that although statistics of individual attribute values may be stored in some data set descriptions, storing statistics for each attribute value tends to consume a large amount of computing resources due to the very large number of combinations for most attribute value combinations, and under limited computational power resources, tends to be impractical, thus resulting in lower accuracy of data processing.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, which can improve the accuracy of data processing.

A data processing method, comprising:

acquiring a data set to be processed, wherein the data set to be processed comprises attribute data corresponding to each attribute in an attribute set;

constructing an attribute network by taking an attribute subset in the attribute set as a node, wherein the attribute network indicates the inclusion relation among the nodes;

based on the attribute network, statistics values of attribute value combinations corresponding to the nodes are counted in the data set to be processed, so that data tags of at least one candidate attribute subset in the attribute subsets are obtained;

predicting the statistical value of a preset attribute value combination in the data set to be processed according to the data tag, wherein an attribute subset corresponding to the preset attribute value combination comprises the candidate attribute subset;

and screening a target attribute subset from the candidate attribute subsets based on the prediction statistic value, and adjusting the attribute data according to the data label of the target attribute subset to obtain a target data set.

Accordingly, an embodiment of the present invention provides a data processing apparatus, including:

The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a data set to be processed, and the data set to be processed comprises attribute data corresponding to each attribute in an attribute set;

a building unit, configured to build an attribute network with an attribute subset in the attribute set as a node, where the attribute network indicates a containment relationship between the nodes;

the statistical unit is used for counting the statistical value of the attribute value combination corresponding to the node in the data set to be processed based on the attribute network so as to obtain the data tag of at least one candidate attribute subset in the attribute subsets;

the predicting unit is used for predicting the statistical value of a preset attribute value combination in the data set to be processed according to the data tag, and an attribute subset corresponding to the preset attribute value combination comprises the candidate attribute subset;

and the screening unit is used for screening a target attribute subset from the candidate attribute subsets based on the prediction statistic value, and adjusting the attribute data according to the data label of the target attribute subset to obtain a target data set.

Optionally, in some embodiments, the statistics unit may be specifically configured to sort each attribute in the attribute set, and add, based on a result of the sorting, each attribute as an element to an initial query queue to obtain a query queue; screening at least one candidate node from the nodes according to the attribute network and the query queue; and counting candidate statistic values of attribute value combinations corresponding to the candidate nodes in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subsets.

Optionally, in some embodiments, the statistics unit may be specifically configured to screen a target element from the query queue based on the sorting result; identifying a target node corresponding to the target element in the attribute network; traversing at least one sub-node corresponding to the target node in the attribute network to obtain the candidate node, wherein the attribute subset corresponding to the target node is contained in the attribute subset corresponding to the sub-node.

Optionally, in some embodiments, the statistics unit may be specifically configured to calculate, in the data set to be processed, a candidate statistical value of an attribute value combination corresponding to the candidate node; screening at least one candidate attribute subset from the attribute subsets based on the candidate statistics; and determining the data label of the candidate attribute subset according to the statistical value corresponding to the candidate attribute subset.

Optionally, in some embodiments, the statistics unit may be specifically configured to screen at least one attribute subset of the candidate statistics value that does not exceed a preset number threshold from the attribute subsets, to obtain an initial candidate attribute subset; adding the initial candidate attribute subset as an element to the query queue, and deleting the target element in the query queue to obtain an updated query queue; and taking the updated query queue as the query queue, returning to execute the step of screening at least one candidate node from the nodes according to the attribute network and the query queue until the initial candidate attribute subset does not exist, and taking the initial candidate attribute subset as the candidate attribute subset.

Optionally, in some embodiments, the statistics unit may be specifically configured to screen a combination statistics value corresponding to each attribute value combination in the candidate attribute subset from the candidate statistics values, to obtain a combination statistics value set; counting attribute statistics values of attribute values corresponding to each attribute in the data set to be processed to obtain an attribute statistics value set; and constructing the data labels of the candidate attribute subsets based on the combined statistic set and the attribute statistic set.

Optionally, in some embodiments, the prediction unit may be specifically configured to determine, based on the preset attribute value combination, a target attribute value combination and a target attribute value corresponding to the candidate attribute subset, where the target attribute value includes at least one attribute value in the preset attribute value combination except for the target attribute value combination; extracting a target combination statistical value corresponding to the target attribute value combination and a target attribute statistical value set corresponding to the target attribute value from the data tag; and predicting the statistical value of the preset attribute combination based on the target combination statistical value and the target attribute statistical value set to obtain the predicted statistical value.

Optionally, in some embodiments, the prediction unit may be specifically configured to compare the preset attribute value combination with an attribute value combination corresponding to the candidate attribute subset; screening a target attribute value combination containing attribute values in the preset attribute value combination from the attribute value combination based on a comparison result; and screening at least one attribute value except the attribute value in the target attribute value combination from the preset attribute value combination to obtain a target attribute value.

Optionally, in some embodiments, the prediction unit may be specifically configured to screen a statistics value corresponding to the target attribute value from the target attribute statistics value set to obtain a target attribute statistics value; fusing the statistical values in the target attribute statistical value set to obtain a fused attribute statistical value corresponding to the target attribute value; and determining a ratio between the target attribute statistic value and the corresponding fusion attribute statistic value, and fusing the ratio with the target combination statistic value to obtain the prediction statistic value.

Optionally, in some embodiments, the filtering unit may specifically be configured to query the attribute network for inclusion relationships between candidate attribute subsets; when the inclusion relationship does not exist, obtaining a labeling statistical value corresponding to the preset attribute value combination; comparing the labeling statistical value with the prediction statistical value to obtain a prediction loss corresponding to the candidate attribute subset; and screening the target attribute subset from the candidate attribute subset according to the prediction loss.

Optionally, in some embodiments, the screening unit may specifically be configured to determine, using the labeling statistic and the prediction statistic as molecules, a ratio between the labeling statistic and the prediction statistic, to obtain a ratio pair; screening out target ratio values in the ratio pairs to obtain prediction errors corresponding to the candidate attribute subsets; and fusing the prediction errors to obtain the prediction loss corresponding to the candidate attribute subset.

Optionally, in some embodiments, the filtering unit may be specifically configured to, when the inclusion relationship exists, classify the candidate attribute subset based on the inclusion relationship, to obtain a first attribute subset and a second attribute subset, where the first attribute subset is included in the second attribute subset; determining the error relationship as if the prediction error of the second subset of attributes is less than the prediction error of the first subset of attributes; and screening the target attribute subset from the candidate attribute subsets based on the error relation.

Optionally, in some embodiments, the obtaining unit may be specifically configured to determine, based on an attribute type of the attribute, an attribute relationship between attributes in the attribute set; when no correlation exists between the attributes, counting the current statistical value of the attribute value corresponding to each attribute in the data set to be processed, and adjusting the attribute data based on the current statistical value to obtain the target data set; the constructing the attribute network by taking the attribute subset in the attribute set as a node comprises the following steps: and when the correlation exists among the attributes, constructing an attribute network by taking the attribute subset in the attribute set as a node.

Optionally, in some embodiments, the data processing apparatus may further include an adjustment unit, where the adjustment unit may be specifically configured to obtain a candidate attribute value combination, where the candidate attribute value group includes a plurality of candidate attribute values; extracting a first statistical value of the candidate attribute value and a second statistical value of the attribute corresponding to the candidate attribute value from the current statistical value; determining a statistic value ratio between the first statistic value and the second statistic value, and fusing the current statistic value to obtain a current fused statistic value; and fusing the current fusion statistic value with the statistic value ratio to obtain a statistic value corresponding to the candidate attribute value combination, and adjusting the attribute data based on the statistic value corresponding to the candidate attribute value combination to obtain the target data set.

In addition, the embodiment of the invention also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores application programs, and the processor is used for running the application programs in the memory to realize the data processing method provided by the embodiment of the invention.

In addition, the embodiment of the invention also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any processing method provided by the embodiment of the invention.

In addition, the embodiment of the application also provides a computer program product, which comprises a computer program or instructions, and the computer program or instructions realize the steps in the data method provided by the embodiment of the application when being executed by a processor.

After a data set to be processed is obtained, an attribute subset in an attribute set corresponding to the data set to be processed is used as a node to construct an attribute network, then, based on the attribute network, statistics values of attribute value combinations corresponding to the node are counted in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subset, then, statistics values of preset attribute value combinations in the data set to be processed are predicted according to the data labels, the attribute subset corresponding to the preset attribute set combination comprises the candidate attribute subset, a target attribute subset is screened out from the candidate attribute subset based on the predicted statistics values, and attribute data corresponding to each attribute in the data set to be processed are adjusted according to the data labels of the target attribute subset to obtain the target data set; according to the scheme, the attribute network is constructed, the attribute network is used as a guide, the target attribute subset is screened out from the attribute subsets, and the statistical value of the limited attribute value combination in the target attribute subset is used as the data label to adjust the attribute data, so that the data applicability can be more accurately determined and the data deviation can be eliminated under the limited computing power resource, and the accuracy of data processing can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a scenario of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 3 is a network schematic diagram of an attribute network provided by an embodiment of the present invention;

FIG. 4 is another flow chart of a data processing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another structure of a data processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium. The data processing device may be integrated in an electronic device, which may be a server or a device such as a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

For example, referring to fig. 1, taking an example that a data processing apparatus is integrated in an electronic device, after the electronic device obtains a data set to be processed, an attribute subset in an attribute set corresponding to the data set to be processed is used as a node to construct an attribute network, then, based on the attribute network, statistics values of attribute value combinations corresponding to the node are counted in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subset, then, according to the data labels, statistics values of preset attribute value combinations in the data set to be processed are predicted, the attribute subset corresponding to the preset attribute set combination comprises the candidate attribute subset, based on the predicted statistics values, a target attribute subset is screened out from the candidate attribute subset, and according to the data labels of the target attribute subset, attribute data corresponding to each attribute in the data set to be processed is adjusted to obtain the target data set, and further, accuracy of data processing is improved.

The data processing method provided by the embodiment of the application relates to a machine learning direction in artificial intelligence. The embodiment of the application can process the data in the data to be processed so as to obtain the target data with eliminated data deviation, and further, the training data is correlated with the target data for model training, so that the deviation of a machine learning model or a decision system can be corrected rapidly.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

It will be appreciated that, in the specific embodiment of the present application, related data such as attribute data, attribute sets, attribute subsets, etc. are referred to, and when the following embodiments of the present application are applied to specific products or technologies, permission or agreement is required, and collection, use, and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.

The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

The present embodiment will be described from the point of view of a data processing apparatus, which may be integrated in an electronic device, and the electronic device may be a server or a device such as a terminal; the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a wearable device, a virtual reality device, or other devices capable of performing data processing.

A data processing method, comprising:

acquiring a data set to be processed, wherein the data set to be processed comprises attribute data corresponding to each attribute in an attribute set, constructing an attribute network by taking an attribute subset in the attribute set as a node, wherein the attribute network indicates the containing relation among the nodes, counting the statistic value of an attribute value combination corresponding to the node in the data set to be processed based on the attribute network so as to obtain a data tag of at least one candidate attribute subset in the attribute subsets, predicting the statistic value of a preset attribute value combination in the data set according to the data tag, and the attribute subset corresponding to the preset attribute value combination contains the candidate attribute subset. And screening the target attribute subset from the candidate attribute subset based on the prediction statistic value, and adjusting the attribute data according to the data label of the target attribute subset to obtain a target data set.

As shown in fig. 2, the specific flow of the data processing method is as follows:

101. a data set to be processed is acquired.

The data set to be processed may include attribute data corresponding to each attribute in the attribute set. The set of attributes may include a plurality of attributes of data in the set of data to be processed. A property may be a basic variable that describes a characteristic of data in a dataset, typically used to describe some aspect or feature of the data, such as the height of an object, the name of a business, etc. The attributes may be used to perform data analysis operations such as statistical analysis, classification, clustering, and the like. The attribute data corresponding to the attribute data may be data corresponding to an attribute in the data set to be processed, for example, the attribute data may be specific identification data of the object by taking the attribute as an identification.

The manner of acquiring the data set to be processed may be various, and specifically may be as follows:

for example, the data set to be processed uploaded by the terminal may be directly obtained, or attribute data of a plurality of attributes may be screened out from the database, so as to obtain a data set to be processed, or at least one training data set may be obtained, a target training data set possibly having a risk of data offset may be screened out from the training data set, the target training data set is taken as the data set to be processed, or a plurality of attribute data of at least one object may be obtained from a network or a database, so as to obtain the data set to be processed, or when the memory of the data to be processed is larger or the number of attribute data is larger, a data processing request may be received, where the data processing request carries a storage address of the data to be processed, and the data set to be processed is obtained based on the storage address.

After the data set to be processed is obtained, determining attribute relationships among attributes in the attribute set based on attribute types of the attributes, counting current statistical values of attribute values corresponding to each attribute in the data set to be processed when no correlation exists among the attributes, and adjusting the attribute data based on the current statistical values to obtain a target data set; when there is a correlation between attributes, an attribute network is constructed with a subset of attributes in the attribute set as nodes.

Wherein the attribute relationship may include one of a presence of a correlation between attributes and a non-presence of a correlation between attributes. The correlation between the attributes characterizes the correlation between the attributes, for example, taking the attributes as the working years and the job levels as examples, in contrast, the longer the working years, the higher or the more senior the corresponding job levels can be, at this time, the correlation between the working years and the job levels, that is, the correlation between the two attributes of the working years and the job levels exists. Based on the attribute types of the attributes, the attribute relationships between the attributes in the attribute sets can be determined in various manners, for example, the attribute type of each attribute in the attribute sets can be determined, the target attribute corresponding to the attribute type is screened out from the preset associated attribute sets, when the target attribute exists in the attribute sets, the correlation between the attributes in the attribute sets is determined, when the target attribute does not exist in the attribute sets, the correlation between the attributes in the attribute sets is determined, or the attribute information such as the attribute type, the attribute identifier and the like can be acquired, the associated attribute with the associated relationship is identified in the attribute sets by adopting an associated relationship identification model based on the attribute information, when the associated attribute exists, the correlation between the attributes is determined, when the associated attribute does not exist, the correlation between the attributes is determined, and the like.

When there is no correlation between attributes, the current statistical value of the attribute value corresponding to each attribute can be counted in the data set to be processed, and the attribute data is adjusted based on the current statistical value to obtain the target data set. The current statistics may include statistics such as the number or frequency of occurrences of the attribute values in the data to be processed. The attribute value may be a possible value that each attribute in the data set to be processed has. There may be various manners of counting the current statistical value of the attribute value corresponding to each attribute in the data set to be processed, for example, the attribute data corresponding to each attribute in the data set to be processed may be traversed, and the number of times or frequency of occurrence of each attribute value may be counted in the attribute data, so as to obtain the current statistical value. Taking the data in the data set to be processed as shown in table 1, taking the number of times that the statistical value appears in the data set to be processed as an example, when the attribute value is enterprise name is enterprise B, the corresponding current statistical value can be 9, and when the attribute value is enterprise name is enterprise a, the corresponding current statistical value can be 9.

TABLE 1

After the current statistical value is counted, the attribute data can be adjusted based on the current statistical value, so that a target data set is obtained. The attribute data may be adjusted in various ways based on the current statistics, for example, a candidate attribute value combination may be obtained, a first statistics of the candidate attribute value and a second statistics of the attribute corresponding to the candidate attribute value are extracted from the current statistics, a statistics ratio of the statistics between the first statistics and the second statistics is determined, the current statistics is fused to obtain a current fused statistics, the current fused statistics is fused to the statistics ratio to obtain a statistics corresponding to the candidate attribute value combination, and the attribute data is adjusted based on the statistics corresponding to the candidate attribute value combination to obtain the target data set.

Wherein the candidate attribute value combination may include a plurality of candidate attribute values. The candidate attribute combinations may include at least one attribute value combination that evaluates data quality and reliability of the data to be processed. The combination of attribute values (which may also be referred to as an attribute combination) may comprise a combination of different attribute values in the data set to be processed. Characteristic information of the data set may be obtained through attribute value combination in data analysis, for example, data distribution, correlation, and the like may be included. In data tagging and classification, attribute value combinations are also important tag information that can be used to detect data bias and algorithm bias in predictive models, and so forth. The attribute value combination may be regarded as a set of attribute values, also referred to as pattern (p). Pattern (p) refers to a set of data in a database that is made up of different combinations of attribute values. These combinations can be used to describe specific situations in the data, such as the range of values of a certain attribute, the relationships between multiple attributes, etc., and also to help usThe data is better understood from which patterns and rules in the data are found. Let D be the attribute a= { a _i ,…a _n A set of data to be processed, let Dom (a _i ) Is A _i Where i E [1 … n ]]Then pattern p= { a _i1 ＝a ₁ ,…A _ik ＝a _k }, wherein { A } _i1 ,…A _ik }∈A，A _j ∈Dom(A _ij ) For each A in p _ij Attr (p) may be used to identify a set of attributes in p. Taking the data in the data set to be processed as table 1 as an example, when the pattern p= { working period = less than 2 years, job level = first order }, the pattern p corresponds to Attr (p) = { working period, job level }. There are various ways of obtaining the candidate attribute value combination, for example, the candidate attribute value combination uploaded by the terminal may be directly obtained, or at least one attribute value combination for evaluating the data quality and reliability of the data set to be processed may be determined based on the data type of the data set to be processed, so as to obtain the candidate attribute value combination, and so on.

After the candidate attribute value combination is obtained, a first statistical value of the candidate attribute value and a second statistical value of the attribute corresponding to the candidate attribute value can be extracted from the current statistical value, for example, taking the candidate attribute value as the region C, the first statistical value (C _D ({ area=area C })) may be 6, and the second statistic may be the sum (18) of the statistics corresponding to all attribute values of the attribute-area.

After the first statistic and the second statistic are extracted, a statistic ratio between the first statistic and the second statistic may be determined, for example, taking the first statistic as 6 and the second statistic as 18 as an example, the corresponding statistic ratio may be 3. Based on the above determination method of the statistic ratio, the statistic ratio corresponding to each candidate attribute value in the candidate attribute value combination can be determined.

In some embodiments, the current statistics may also be fused to obtain the current fused statistics. The fusion statistic here can be regarded as the total number of all attribute value combinations in the data set to be processed. For example, a binary attribute A having n numbers ₁ ,A ₂ ,…A _n Each of the value combinations (b ₁ ，...，b _n )，b _i E {0,1}, happens to occur once, in which case the statistics of all patterns may include 2 ⁿ Data, i.e., specifying that the current fusion statistic may be 2 ⁿ 。

After the statistic value ratio and the current fusion statistic value are determined, the current fusion statistic value and the statistic value ratio can be fused to obtain the statistic value corresponding to the candidate attribute value combination. The fusion method can be various, for example, the ratio of the current fusion statistic value to the statistic value corresponding to each candidate attribute value can be multiplied to obtain the statistic value corresponding to the candidate attribute value combination, so as to obtain b in the value combination _i E {0,1}, candidate attribute value combinations (pattern p= { a _i1 ＝b _i1 ,,…,A _ik ＝b _ik For example, the statistics of the candidate attribute value combination may be as shown in formula (1), and may specifically be as follows:

wherein C is _D (p) is the statistical value of candidate attribute value combination (mode p), D is the data set to be processed, D is the current fusion statistical value corresponding to the data set to be processed, c _D ({A _ij ＝b _ij }) is the attribute value b _ij First statistical value of (c) _D ({A _ij ＝0})+c _D ({A _ij =1 }) is the second statistic for the corresponding attribute.

Exemplary, the candidate attribute values are combined into p= { A ₁ ＝0,A ₂ ＝0,A ₃ For example, =0 } assuming a known statisticReferring to formula (1), the statistical value of the candidate attribute value combination can be estimated, and specifically, the statistical value can be shown by formula (2), and specifically, the statistical value can be as follows:

wherein C is _D (p) is the statistical value of the candidate attribute value combination, c _D ({A _i =0 }) is the first statistic corresponding to the candidate attribute value, (c) _D ({A _ij ＝0})+c _D ({A _ij =1 }) is the second statistic for the corresponding attribute.

After the ratio of the current fusion statistic value to the statistic value is fused, the attribute data can be adjusted based on the statistic value corresponding to the fused candidate attribute value, so that a target data set is obtained. At this time, the target data set may include a data set from which the data deviation is eliminated. The adjustment mode of the attribute data may be various, for example, the adjustment mode corresponding to the data set to be processed is determined based on the data type of the data set to be processed, when the data set to be processed is training data, data correction is performed on the data set to be processed based on the statistical value corresponding to the candidate attribute value combination to obtain a target data set, when the data set to be processed is financial data, data cleaning is performed on the data set to be processed to obtain the target data set, and when the data set to be processed is the object recommended data set, at least one target object corresponding to the target attribute value combination is screened out from the data set to be processed based on the statistical value corresponding to the candidate attribute value combination to obtain the target data set.

When the data set to be processed is training data, there may be various ways of performing data correction on the data set to be processed, for example, performing data deviation evaluation on the data set to be processed based on a statistical value corresponding to the candidate attribute value combination, and performing downsampling, oversampling or data enhancement on the data set to be processed when the data set to be processed has data deviation, so as to obtain the target data set.

After the data correction is carried out on the data set to be processed, the target data set can be used as target training data, and the model is trained based on the target training data, so that the model can learn the target training data better, and the estimated deviation of the model is reduced.

When the data set to be processed is financial data, there may be various ways of cleaning the data set to be processed, for example, abnormal data may be identified in the data set to be processed by combining statistics values corresponding to candidate attribute values, and cleaning the abnormal data in the data set to be processed to obtain a target data set. For example, the anomaly data may include clusters of objects that are abnormal in interaction behavior, and the abnormal clusters of objects are deleted in the set of data to be processed to obtain the target set of data.

When the data to be processed is the object recommended data set, there may be various ways of screening at least one target object corresponding to the target attribute value combination in the data set to be processed based on the statistic value corresponding to the candidate attribute value combination, for example, at least one target attribute data with the statistic value corresponding to the candidate attribute value combination exceeding a preset statistic value threshold may be screened out from the data to be processed, and at least one target object is identified in the target attribute data, thereby obtaining the target data set. For example, the to-be-processed data set is taken as the object historical interaction behavior data set, at least one target attribute value combination with the click rate or conversion rate higher than a preset threshold value is selected according to the difference of the client click rates or conversion rates of different attributes predicted by the statistical values, and the target object corresponding to the target attribute value combination is identified in the to-be-identified data set, so that the target data set is obtained.

After the target object corresponding to the target attribute value is screened out from the data set to be processed, the content to be recommended can be obtained, and each target object in the target data set is recommended to the content to be recommended.

The content modes of the content to be recommended can be various, for example, advertisement, commodity or other content which can be recommended, and the like.

In some embodiments, when there is no correlation between attributes, a subset of attributes in the set of attributes may be used as nodes to construct an attribute network, as described in more detail below.

102. And constructing an attribute network by taking the attribute subset in the attribute set as a node.

Wherein the attribute network may indicate inclusion relationships between the nodes. The inclusion relationship may include inclusion and inclusion, e.g., in that the set of attributes may include a subset S of attributes ₁ And attribute subset S ₂ Attribute subset S ₁ For attribute subset S ₂ For example, then attribute subset S ₁ Contained in a subset of attributesAlternatively, the attribute subset S ₂ Contains attribute subset->Nodes in the attribute network can be all attribute subsets in the attribute set, and edges in the attribute network can be { S } ₁ ，S ₂ }。

The attribute subset in the attribute set may be used as a node to construct an attribute network in various manners, and specifically may be as follows:

for example, all attribute subsets may be extracted from the attribute set, inclusion relationships between the attribute subsets may be determined based on elements in the attribute subsets, and the attribute network may be obtained by constructing the graph network using the attribute subsets as nodes based on the inclusion relationships.

The attribute network may also be referred to as a label network, that is, the network may guide a target attribute subset to screen out an optimal target attribute subset, and determine a label of the target attribute subset. The attribute network may be a graph network, given a data set D with attribute a, let a ^* For the set of all possible subsets of a, then the attribute network of D is a graph g= (V, E), v=a ^* ， In addition, if there is one side { S } ₁ ，S ₂ }, and->(or->) Then node S ₁ Is node S ₂ If it can pass S ₁ Adding an attribute a epsilon A\S ₁ (attribute a belongs to set A minus set S ₁ Subset of (c) to obtain S ₂ S is then ₁ Is S ₂ Is a parent attribute node (i.e., parent node). Taking the data set D as the data corresponding to table 1, respectively taking g as the name of the enterprise, a as the working year, r as the area, and m as the job level as an example, the attribute network corresponding to the data set D may be as shown in fig. 3.

103. Based on the attribute network, statistics values of attribute value combinations corresponding to the nodes are counted in the data set to be processed, so that data labels of at least one candidate attribute subset in the attribute subsets are obtained.

The data tag may be a form of information, which is used to describe statistical information of different modes (attribute value combinations) in the data set to be processed, and in short, may be understood as statistical values of different attribute value combinations in the data set to be processed.

The candidate attribute subset may be understood as a candidate attribute subset that may be used for estimating or evaluating risk deviations in the data set to be processed. The data labels of the candidate attribute subsets may comprise data sets (PC) corresponding to statistics of combinations of different attribute values corresponding to the candidate attribute subsets, and may further comprise data sets (VC) corresponding to statistics corresponding to each individual attribute value in the data set to be processed.

The manner of counting the statistical value of the attribute value combination corresponding to the node in the data set to be processed based on the attribute network may be various, and may specifically be as follows:

for example, each attribute in the attribute set may be ranked, and based on the ranking result, each attribute is added as an element to the query queue, at least one candidate node is selected from the nodes according to the attribute network and the query queue, and candidate statistics of attribute value combinations corresponding to the candidate node are counted in the data set to be processed, so as to obtain a data tag of at least one candidate attribute subset in the attribute subsets, which may be specifically as follows:

s1, sorting each attribute in the attribute set, and adding each attribute as an element to an initial query queue based on a sorting result to obtain a query queue.

The initial query queue may be an initialized query queue, and the initialized query queue may not include any element, that is, before each attribute is added as an element to the query queue based on the sorting result. By a query queue is understood a queue of query nodes (parent/child).

The manner of sorting each attribute in the attribute set may be various, and specifically may be as follows:

for example, each attribute in the attribute set may be ranked based on the importance of the attribute in the attribute set, so as to obtain a ranking result corresponding to the attribute set, or may be ranked based on the frequency of the attribute in the attribute set, so as to obtain a ranking result corresponding to the attribute set, and so on.

After each attribute in the set of attributes is ordered, each attribute may be added as an element to the initial query queue, resulting in an element-added query queue. For example, taking an example that the attribute set includes four attributes of g (enterprise name), a (service life), r (region), m (job level), and the sorting result is (g, a, r, m), the four attributes can be added as elements to the initial query queue based on the sorting result, and the query queue after adding the elements can be [ { g }, { a }, { r }, { m } ].

S2, screening at least one candidate node from the nodes according to the attribute network and the query queue.

For example, based on the sorting result, the target element may be screened out from the query queue, the target node corresponding to the target element may be identified in the attribute network, and at least one child node corresponding to the target node may be traversed in the attribute network to obtain the candidate node.

The method for screening the target element in the query queue based on the sorting result may be various, for example, a first element may be screened out in the query queue to obtain the target element, or an element corresponding to a preset sorting position may be screened out in the query queue to obtain the target element.

After the target element is screened out from the query queue, the target node corresponding to the target element can be identified in the attribute network. There may be various ways to identify the target node corresponding to the target element, for example, taking the target element { g } as an example, and taking the node with the attribute subset { g } as identified in the attribute network, so as to obtain the target node.

After identifying the target node, at least one child node corresponding to the target node may be traversed in the attribute network, thereby obtaining candidate nodes. The attribute subset corresponding to the target node is included in the attribute subset corresponding to the child node. The traversing way of traversing at least one father node corresponding to the target node in the attribute network can be various, for example, other attributes except for the target attribute in the attribute subset corresponding to the target node can be screened out from the attribute set, the node corresponding to at least one attribute subset composed of the target attribute and any one or more other attributes is traversed in the attribute network, at least one child node is obtained, and at least one child node is used as a candidate node.

The node corresponding to the target attribute and at least one attribute subset formed by any one or more other attributes can be traversed in the attribute network in various manners, for example, a gen () operator can be adopted, and at least one father node of the target node is traversed in the attribute network, so that a candidate node is obtained.

Wherein, for the gen () operator, the data set D to be processed is assumed to be one having the attribute a= { a ₁ ，...，A _n Data set of attributes ordered and for a given subset of attributesUsing idx (S) to represent the index value in S with the largest attribute index, i.e., idx (S) =max _i ({A _i |A _i E S), the defined gen (S) may be as shown in formula (3), and may be specifically as follows:

the gen (S) may be a set of attribute subsets corresponding to child nodes of the corresponding target node of the attribute subset S generated based on the gen () operator, and S is an attribute subset corresponding to the child node. For a given subset S of attributes, a setThe so-called child (S) may be a set of attribute subsets corresponding to all child nodes of S in the attribute network corresponding to the data set D to be processed. For example, taking the data of the data D to be processed as the data of table 1, the attribute subset s= { business name, area } as an example, gen (S) may be { business name, area, job level }, and in addition, it can be seen that { business name, service life, area } is also a child node of S in the attribute network, but the attribute subset corresponding to the child node is not included in gen (S).

When traversing at least one child node corresponding to the target node in the attribute network, the node can be naturally scanned from top to bottom in the attribute network. Traversing nodes does not require a displayed representation, as child nodes may be generated from their respective parent nodes as needed.

S3, counting candidate statistical values of attribute value combinations corresponding to the candidate nodes in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subsets.

For example, candidate statistics values of attribute value combinations corresponding to candidate nodes can be counted in the data set to be processed, at least one candidate attribute subset is screened out from the attribute subsets based on the candidate statistics values, and data tags of the candidate attribute subset are determined according to the statistics values corresponding to the candidate attribute subset, specifically, the following can be adopted:

(1) And counting candidate statistical values of attribute value combinations corresponding to the candidate nodes in the data set to be processed.

For example, the number of data or the number of occurrences corresponding to the attribute value combinations corresponding to the candidate nodes may be counted in the data set to be processed, so as to obtain corresponding candidate statistic values, or the frequency of the attribute value combinations corresponding to the candidate nodes may be counted in the data set to be processed, so as to obtain corresponding candidate statistic values, or the like.

The data in the data set to be processed is taken as the data in table 1, the statistical value is the number of data of attribute value combinations in the data set to be processed, and the candidate attribute value combination is { business name=business a, working year=less than 2 years } as an example, so that the number of data of the candidate attribute value combination in the data set to be processed can be 3, and therefore, the candidate statistical value of the candidate attribute value combination can be 3.

(2) At least one candidate attribute subset is screened out of the attribute subsets based on the candidate statistics.

For example, at least one attribute subset with the number of candidate statistics not exceeding a preset number threshold may be selected from the attribute subsets, an initial candidate attribute subset is obtained, the initial candidate attribute subset is used as an element to be added to the query queue, the target element is deleted in the query queue, an updated query queue is obtained, the updated query queue is used as the query queue, and the step of selecting at least one candidate node from the nodes according to the attribute network and the query queue is performed back until the initial candidate attribute subset does not exist, and the initial candidate attribute subset is used as the candidate attribute subset.

Wherein the preset number threshold may be a preset limit B _S The number of statistical values of the combinations of different attribute values contained in the subset of candidate attributes does not exceed the limit B _S . Each set of data to be processed may correspond to a boundary B _S . Taking the data of the data to be processed as the data of table 1, the attribute subset is { g (business name), a (working year) } as an example, and the attribute subset corresponding to the attribute value combination may include { business name=business a }Service life = less than 2 years }, { business name = business B, service life = less than 2 years }, { business name = business a, service life = 2-5 years }, and { business name = business B, service life = 2-5 years }, etc. Thus, the number of statistics for the attribute subset { g, a } may be 4. If the preset number threshold is 5, the attribute subset { g, a } may be an initial candidate attribute subset and added to the candidate attribute subset list (cands), and if the preset number threshold is 3, the attribute subset { g, a } does not belong to the initial candidate attribute subset.

After the initial candidate attribute subset is screened out from the attribute subsets, the initial candidate attribute subset can be used as an element to be added into the query queue, and the target element is deleted from the query queue, so that the updated query queue is obtained. For example, taking the query queue [ { g }, { a }, { r }, { m } ], the initial candidate attribute subset includes { g, a } as an example, at this time { g, a } may be sequentially added to the query queue in order, the query queue may be [ { g }, { a }, { r }, { m }, { g, a } ], and the target element is deleted in the query queue, that is, the parent node { g } corresponding to { g, a } is deleted, so that the updated query queue Q is [ { a }, { r }, { m }, { g, a } ].

After updating the query queue, the updated query queue may be used as the query queue, which means that an iteration has ended, and then the step of screening at least one candidate node from the nodes according to the attribute network and the query queue may be performed back until the initial candidate attribute subset is not present, and the initial candidate attribute subset is used as the candidate attribute subset. With the updated query queue Q [ { a }, { r }, { m }, { g, a }]The data of the data set to be processed is the data of Table 1, limit B _S For example, when the next iteration is performed, { a } can be extracted from Q red as the target element, and gen ({ a }) = { a, r }, { a, m }, then the number of candidate statistics corresponding to { a, r } is counted as 3, and the number of candidate statistics corresponding to { a, m } is counted as 6, so that in this round of iteration, { a, r } can be the initial candidate attribute subset, and { a, r } is added to cands, where [ { g, a }, { a,r}]. Then, no other attribute subsets in the next iteration may have statistics of sufficient size, i.e., no initial candidate attribute subsets exist, and it is understood that no other attribute subsets may generate data tags of sufficient size, so after all elements in Q are extracted, the loop terminates and the final cands may include [ { g, a }, { a, r } ]It is possible to take [ { g, a } and { a, r } as candidate attribute subsets.

It should be noted that, the number of candidate attribute subsets may include one or more, when there are a plurality of candidate attribute subsets, an optimal target attribute subset needs to be screened out from the candidate attribute subsets, and when there are no plurality of candidate attribute subsets, the candidate attribute subset may be used as the target attribute subset.

The algorithm generates each node at most once when traversing the nodes from top to bottom in the attribute network by using the gen () operator, that is, the initial candidate attribute subset corresponding to the target node which can only be output once in the gen () operator. Furthermore, the nodes generated are limited to only the corresponding attribute subsets of the attribute sets in the data set to be processed, and the corresponding statistics (tag size) of these attribute sets need to be below a given limit B _S And (in the worst case) their child nodes also meet this limitation.

(3) And determining the data labels of the candidate attribute subsets according to the statistics values corresponding to the candidate attribute subsets.

For example, a combination statistical value corresponding to each attribute value combination in the candidate attribute subset may be screened out from the candidate statistical values to obtain a combination statistical value set, an attribute statistical value of an attribute value corresponding to each attribute is counted in the data set to be processed to obtain an attribute statistical value set, and a data tag of the candidate attribute subset is constructed based on the combination statistical value and the attribute statistical value set.

The combination statistics set (PC) may include statistics corresponding to each attribute value combination in the candidate attribute subset, for example, taking data in the data set to be processed as data in table 1, and the candidate attribute subset is { business name, working year } as an example, where the attribute value combination corresponding to the candidate attribute subset may include four of { business name=business a, working year=less than 2 years }, { business name=business B, working year=less than 2 years }, { business name=business a, working year=2-5 years }, and { business name=business B, working year=2-5 years }, and the combination statistics set may include statistics corresponding to the 4 attribute value combinations.

The attribute statistics may include statistics of different attribute values of a single attribute in the attribute set in the to-be-processed data set, for example, taking the to-be-processed data set including the data in table 1 as an example, the attribute set may include { business name, working year, area, job level }, the attribute statistics may be statistics corresponding to different attributes in the business name/working year/area/job level, the corresponding attribute statistics set (VC) may include statistics corresponding to different attribute values in the business name, statistics corresponding to different attribute values in the working year, statistics corresponding to different attribute values in the area, statistics corresponding to different attribute values in the job level, and so on.

After the combined statistic set and the attribute statistic set are filtered out, the data tag of the candidate attribute subset may be constructed based on the combined statistic set and the attribute statistic set. There are various ways to construct the data tag, for example, the combination statistics set and the attribute statistics set may be directly fused, so as to obtain the data tag.

Wherein given a data set D, the attribute set a= { a ₁ ,A ₂ …A _n }. The data tag is defined for a subset S of the set a of attributes of the data set, which contains a Pattern Count (PC) on S for each possible pattern (combination of attribute values) and a statistics count (VC) for each attribute value appearing in D. Given a subset of attributes S.epsilon.A, use P _s To represent the set of all possible patterns (attribute value combinations) on S (i.e. p with Attr (p) =s), such that c _D (p)>0，c _D (p) can be understood as the statistic (i.e. the number of occurrences) of pattern p in data set D. P (P) _s The maximum number of medium modes may beDom(A _i ) May be an attribute value space, which may be understood as a range of attribute values, such as an attribute space Dom (a _i )＝{0,1}。

Wherein, the candidate attribute subset is S, and S epsilon A, the data label corresponding to the candidate attribute subset S can be L _s (D) The data label L _s (D) The set pc= { (p) may be included _i ,c _D (p _i )}，(p _i ∈P _S ) The set vc= { ({ a) may also be included _i ＝a _j ,c _D ({A _i ＝a _j }))}(A _i ∈A,a _j ∈Dom(A _i )). For example, taking the example that the data set D to be processed includes the data corresponding to table 1, and the candidate attribute subset s= { working years, job level }, the data tag of the candidate attribute subset S may include the following contents:

pc= { ({ working period = 2 years or less, job level = first order }, 6), ({ working period = 2-5 years, job level = second order }, 6), { working period = 2-5 years, job level = third order }, 6) };

vc= { ({ business name=business a, 9), ({ business name=business B, 9), ({ business year=less than 2 years, 6), ({ business year=2-5 years, 12), ({ area=area a, 6), ({ area=area B, 6), ({ area=area C, 6), ({ job grade=first grade, 6), ({ job grade=second grade, 6), ({ job grade=third grade, 6) }.

Wherein, for a given data set D to be processed, among the data labels of different candidate attribute subsets, VC sets are determined, and the labels (attribute values, corresponding statistics) corresponding to each piece of identical attribute data of D are the same, but PC is different, for example, taking the data of the data set D to be processed including the data of table 1, candidate attribute subset S, = { business name, working period } as an example, VC of the candidate attribute subset is the same as VC of the candidate attribute subset S, and PC of the candidate attribute subset may include the following:

Pc= { ({ business name=business a, business year=less than 2 years }, 3), ({ business name=business B, business year=less than 2 years }, 3), ({ business name=business a, business year=2-5 years }, 6), ({ business name=business B, business year=2-5 years }, 6) }.

104. And predicting the statistical value of the preset attribute value combination in the data set to be processed according to the data label.

The attribute subset corresponding to the preset attribute value combination includes a candidate attribute subset, which indicates that the candidate attribute subset may be a subset of the attribute subset corresponding to the preset attribute value combination. For example, the attribute set corresponding to the data set to be processed is a, and the candidate attribute subset is S ₁ Presetting an attribute subset corresponding to the attribute value combination as S ₂ For example, then

The manner of predicting the statistical value of the preset attribute value combination in the data set to be processed according to the data tag may be various, and may specifically be as follows:

for example, a target attribute value combination and a target attribute value corresponding to the candidate attribute subset may be determined based on a preset attribute value combination, a target combination statistical value corresponding to the target attribute value combination and a target attribute statistical value set corresponding to the target attribute value are extracted from the data tag, and a statistical value of the preset attribute combination is predicted based on the target combination statistical value and the target attribute statistical value combination, so as to obtain a predicted statistical value.

The target attribute value combination may include a part of attribute values in the preset attribute value combination or attribute value combinations, that is, attribute values in the target attribute value combination all include attribute values in the preset attribute value combination. For example, taking the example that the preset attribute value combination includes { business name=business B, service life=less than 2 years, job level=second level }, and the candidate attribute subset includes { business name, service life }, the target attribute value combination may include { business name=business B, service life=less than 2 years }. The target attribute value includes at least one attribute value of the preset attribute value combination other than the target attribute value combination. Or taking the example that the preset attribute value combination includes { business name=business B, working year=less than 2 years, job rank=second rank }, and the target attribute value combination includes { business name=business B, working year=less than 2 years }, the target attribute value may include { job rank=second rank }. The mode of determining the target attribute value combination corresponding to the candidate attribute subset and the target attribute value may be various based on the preset attribute value combination, for example, the preset attribute value combination may be compared with the attribute value combination corresponding to the candidate attribute subset, the target attribute value combination including the attribute values in the preset attribute value combination may be screened out based on the comparison result, at least one attribute value other than the attribute values in the target attribute value combination may be screened out from the preset attribute value combination to obtain the target attribute value, or the current set of the attribute value combination corresponding to the candidate attribute subset may be determined, the attribute value combination corresponding to the subset of the preset attribute value combination may be screened out from the current set to obtain the target attribute value combination, and at least one attribute value other than the target attribute value combination may be screened out from the preset attribute value combination to obtain the target attribute value.

After the target attribute value combination and the target attribute value are determined, a target combination statistical value corresponding to the target attribute value combination and a target attribute statistical value set corresponding to the target attribute value can be extracted from the data tag. The method for extracting the target combination statistical value and the target attribute statistical value combination can be various, for example, the combination statistical value corresponding to the target attribute value combination can be extracted from the PC of the data tag to obtain the target combination statistical value, and the statistical value set of the attribute corresponding to the target attribute value is extracted from the VC set of the data tag to obtain the target attribute statistical value set. For example, taking the example that the target attribute value includes { job level=second level }, statistics of different attribute values corresponding to { job level }, { job level=third level }, and { job level=first level } may be extracted from the VC set, and the attribute statistics corresponding to { job level=second level }, and { job level=first level } may be used as the target attribute statistics set.

After the target combination statistical value and the target attribute statistical value set are extracted, various manners of predicting the statistical value of the preset attribute combination can be adopted based on the target combination statistical value and the target attribute statistical value set, for example, the statistical value corresponding to the target attribute value can be screened out from the target attribute statistical value set to obtain the target attribute statistical value, the statistical values in the target attribute statistical value set are fused to obtain the fusion attribute statistical value corresponding to the target attribute value, the ratio between the target attribute statistical value and the corresponding fusion attribute statistical value is determined, and the ratio is fused with the target combination statistical value to obtain the prediction statistical value.

The target attribute statistic may be a statistic of the target attribute value in the data set to be processed. Because the target attribute statistic value set contains the statistic value of each attribute value in the attribute subset corresponding to the target attribute value, the statistic value corresponding to the target attribute value can be directly screened out from the target attribute statistic value set to obtain the target attribute statistic value.

The method for fusing the statistics values in the target attribute statistics value set may be various, for example, the statistics values in the target attribute statistics value set may be accumulated, so as to obtain the fused statistics values of the attribute subsets corresponding to the target attribute values. For example, taking the example that the attribute subset corresponding to the target attribute value includes { business name }, the fused attribute statistic corresponding to the target attribute value may include the sum of the two attribute values of { business name=business a } and { business name=business B }.

After the statistics values in the target attribute statistics value set are fused, the ratio between the target attribute statistics value and the corresponding fusion attribute statistics value can be determined, and the ratio is fused with the target combination statistics value to obtain a prediction statistics value, which can be specifically shown as a formula (4), and can be specifically as follows:

Wherein Est (p, l) is a predicted statistical value, p is a preset attribute value combination, S ₁ For candidate attribute subsets, S ₂ The attribute subset corresponding to the preset attribute value combination is D is a data set to be processed, A _i For the target attribute corresponding to the target attribute value, a _j For the purpose ofAttribute value in the target attribute, C _D ({A _i ＝a _j }) is the statistical value of the attribute values in the data set to be processed, dom (A) _i ) Is attribute A _i Is a value range space of (a).

Wherein the data set D to be processed comprises the data of table 1, and the candidate attribute subset S ₁ By way of example, the candidate subset S is used with = { business name, job level }, preset attribute value combination (pattern) p = { business name=business a, working years = 2-5 years, job level = second level } ₁ The statistics of the prediction mode p of the corresponding data tag l may be shown in the formula (5), and may specifically be as follows:

wherein Est (p, l) is a predictive statistic value of mode p (preset attribute value combination), and Dom enterprise name is a value range space with { enterprise name }, a _j For attribute values { business name=business B } and { business name=business a } may be included, D being the data set to be processed.

For the data labels of different candidate attribute subsets, when the statistics of the same preset attribute value combination (mode p) are predicted, the predicted statistics values can be the same or different. Also take the mode p= { business name=business a, the working period=2-5 years, and the job level=second level } as an example, at this time, the candidate attribute subset is replaced with S '= { business name, the working period }, where the data tag of the candidate attribute subset may be L' =l _S’ (D) At this time, the preset attribute value combination (pattern p) may also be predicted by l', and the prediction process may be as shown in formula (6), specifically as follows:

wherein Est (p, l ') is a statistic value of a mode p predicted based on a data label l', dom (job level) is a value range space with a { job level }, a _j For attribute values, may include { job level=first level }, { job level=second level }, and{ job level=three level }, D is the data set to be processed. It is thus found that the statistics of the same pattern p are predicted using different candidate attribute subsets, and the predicted statistics may be different, so that there may be prediction errors (estimation errors) in the different candidate attribute subsets, and therefore, it is further required to select an optimal target attribute subset from the candidate attribute subsets, for a specific procedure described below.

In this scheme, according to different combination numbers of attribute values, marking each combination number in the data set to be processed, and then predicting (estimating) the number of each combination in the data set to be processed through an estimation function corresponding to the formula (4), so as to realize automatic data tag generation.

105. And screening the target attribute subset from the candidate attribute subset based on the prediction statistic value, and adjusting the attribute data according to the data label of the target attribute subset to obtain a target data set.

The target attribute subset may be understood as an optimal attribute subset of the candidate attribute subsets, and the accuracy of any or preset attribute value combinations (modes) in the data set predicted by the target attribute subset is optimal or highest.

The method for obtaining the target data set by screening the target attribute subset from the candidate attribute subset based on the prediction statistic value and adjusting the attribute data according to the data label of the target attribute subset specifically comprises the following steps:

C1. and screening out a target attribute subset from the candidate attribute subsets based on the prediction statistic.

For example, the inclusion relationship between the candidate attribute subsets may be queried in the attribute network, when the inclusion relationship does not exist, a labeling statistical value corresponding to the preset attribute value combination is obtained, the labeling statistical value is compared with a prediction statistical value, so as to obtain a prediction loss corresponding to the candidate attribute subset, and the target attribute subset is screened out from the candidate attribute subset according to the prediction loss.

Wherein, the inclusion relationship may include a relationship such as "include" and "include". The method for querying the inclusion relationship between the candidate attribute subsets in the attribute network may be various, for example, the node relationship between the nodes corresponding to the candidate attribute subsets may be traversed in the attribute network, when the parent node and the child node exist in the nodes corresponding to the candidate attribute subsets, the inclusion relationship between the candidate attribute subsets is determined, and when the nodes corresponding to the candidate attribute subsets are all located in the same layer in the attribute network, that is, when the parent node and the child node do not exist between the nodes, the node relationship between the candidate attribute subsets is determined. For example, taking the example that the attribute set includes { g, a, r, m }, when the candidate attribute subset includes { g, a } and { a, r }, there is no inclusion relationship between the candidate attribute subsets, and when the candidate attribute subset includes { g, a } and { g, a, r }, the attribute subset { g, a } is included in { g, a, r }, it is determined that there is an inclusion relationship between the candidate attribute subsets.

And when determining that the candidate attributes do not have the inclusion relationship, acquiring the labeling statistical value corresponding to the preset attribute value combination. By labeling statistics is understood true statistics of preset attribute combinations in the data set to be processed.

After the labeling statistical value corresponding to the preset attribute value combination is obtained, the labeling statistical value and the prediction statistical value can be compared to obtain the prediction loss information corresponding to the candidate attribute subset. So-called predictive loss information may characterize the differences or errors between statistics of the predictive preset attribute value combinations (pattern p) based on data labels of the candidate attribute subsets. For pattern p, label l=l _S (D) The error of (2) may be as shown in formula (7), and may be specifically as follows:

Err(l,p)＝|c _D (p)-Est(p,l)| (7)

wherein Err (l, p) is the predicted maximum error, c _D And (p) is a labeling statistic value, and Est (p, l) is a prediction statistic value. The maximum error of prediction can be determined by the error index, so that the comparison of the labeling statistic and the prediction statistic can be carried out in various ways, for example, the labeling statistic and the prediction statistic can be respectively used as molecules to determine the ratio between the labeling statistic and the prediction statistic to obtain a ratio pair, and the ratio pair is used for the comparison And screening out the target ratio to obtain a prediction error corresponding to the candidate attribute subset, and fusing the prediction error to obtain a prediction loss corresponding to the candidate attribute subset.

The method for screening the target ratio in the ratio pair may be various, for example, the largest ratio in the ratio pair may be screened to obtain the target ratio, and the target ratio is used as a prediction error of a prediction attribute value combination corresponding to the candidate attribute subset prediction, and specifically may be as shown in formula (8):

wherein q-error (p) is an error index, and refers to the error rate of the tag, that is, the ratio of errors in the data items contained in the tag. The prediction error corresponding to the error index can be the maximum error of the prediction, c _D And (p) is a labeling statistic value, and est (p) is a prediction statistic value. It should be noted that this error indicator, q-error, is relative, symmetrical, and is generally preferred because of its penalty selectivity estimate. In addition, in the error index, the maximum value of the two can be taken to count the too high and too low prediction errors. The selective estimation technique is intended to be optimized for the query and is related to the quality of the query plan. In this scheme, the absolute maximum error (rather than the average) is of primary concern, as such error definition is more stringent and an error "upper bound" can be perceived in a large number of modes of the dataset.

After the target ratio is screened out, the prediction errors corresponding to the target ratio can be fused, so that the prediction loss corresponding to the candidate attribute subset is obtained. One or more preset attribute value combinations can be provided, for the preset attribute value combinations, prediction errors corresponding to the candidate attribute subsets can be obtained, and then the prediction errors corresponding to different preset attribute value combinations in the same candidate attribute subset are fused, so that the prediction loss information of the candidate attribute subset is obtained. In addition, the prediction error herein may be regarded as an absolute error. There are various ways to fuse the prediction errors, for example, an average value of the prediction errors may be determined to obtain the prediction loss, or a mean square error (standard deviation) of the prediction errors may be determined to obtain the prediction loss.

After comparing the labeling statistics with the prediction statistics, a target attribute subset may be screened out of candidate attribute subsets based on the obtained prediction loss. The method for screening the target attribute subset may be various, for example, a candidate attribute subset with the smallest prediction loss may be screened out from the candidate attribute subsets, so as to obtain the target attribute subset, or may also be based on the attribute types in the candidate attribute subsets, determine the prediction weights corresponding to the candidate attribute subsets, respectively weight the prediction loss based on the prediction weights, so as to obtain the weighted prediction loss, screen out the candidate attribute subset with the smallest weighted prediction loss from the candidate attribute subsets, so as to obtain the target attribute subset, or the like.

In some embodiments, when the inclusion relationship exists, classifying the candidate attribute subsets based on the inclusion relationship to obtain a first attribute subset and a second attribute subset, wherein the first attribute subset is included in the second attribute subset, determining that the error relationship is that the prediction error of the second attribute subset is smaller than the prediction error of the first attribute subset, and screening the target attribute subset from the candidate attribute subsets based on the error relationship.

Wherein, the determining of the error relation may specifically include setting the data set D to be processed as an attribute set having an attribute set A, S as an attribute subset, that isl＝L _S (D) Is tag data using S. For a given pattern p, if Est (l, p) =c _D (p), then one can determine the l-prediction (estimate) p as an accurate prediction (estimate), if Est (l, p)>C _D (p), then one can determine that l predicts (estimates) p as too high a prediction (estimates), if Est (l, p)<C _D (p), it can be determined that l predicts (estimates) p as underestimated predictions (estimates), obviously for each mode p if +.>The use of prediction (estimate) p is an accurate prediction (estimate), and it can furthermore be proved that given two property subsets +.>And->(i=1 or 2), for each pattern p, such that +.>Let p' =p|attr (p) ≡s ₂ To limit p to appear only in S ₂ A pattern obtained when the attribute is in (b). If l is used ₁ Predicting p' is an over-high (or under-estimated) prediction, and uses l ₂ Prediction p is an over-high (or under-estimated) prediction, then Err (l) ₂ ,p)≤Err(l ₁ P), i.e. S ₂ Is less than S ₁ Is used for the prediction error of (a). Thus, for two attribute subsets S ₁ And S is ₂ If->Then it can be determined to use S ₂ The generated data label is more than that of S ₁ The generated data tag is more detailed or accurate.

After determining the error relationship, a target attribute subset may be selected from the candidate attribute subsets. There are various ways to screen the target attribute subset, for example, a candidate attribute subset with the smallest prediction error may be screened from the candidate attribute subsets, so as to obtain the target attribute subset. For example, taking the candidate attribute subset including { g }, { g, a } and { g, a, r } as examples, based on the error relationship, it can be determined that the prediction error of { g, a, r } is smaller than { g, a }, and the prediction error of { g, a } is smaller than { g }, so that the candidate attribute subset with the smallest prediction error can be determined as { g, a, r }, and then { g, a, r } is taken as the target attribute subset.

It should be noted that, a top-down algorithm may be adopted, and based on the attribute network, the target attribute subset is screened out from the attribute subsets, and the code of the algorithm may be as follows:

For the top-down algorithm, the data in Table 1 may be included in the set of data to be processed D, the pattern set P and the bounds B for all the data in the set of data to be processed D _s For example, =5, the specific flow of the top-down algorithm may include: initializing Q to [ { g }, { a }, { r }, { m }]And sets the cands to an empty set. In the first iteration { g } is extracted from Q and a subset thereof is generated using gen ({ g }) of { g, a }, { g, r }, { g, m }, in this set { g, a } is a subset of tag sizes (number of statistics) less than 5, and therefore is added to Q and cands. In the next iteration { a } is extracted from Q and the elements in gen ({ a = { a, r }, { a, m }) are checked by an algorithm using { a, r } to generate a tag size of 3 and { a, m } to generate a tag size of 6, so { a, r } is added to Q and cands, no other subset can generate a tag of sufficient size in the next iteration, so after all elements in Q are extracted, the while loop is terminated.

In the scheme, based on a top-down algorithm, the generated tag is optimized by constructing an attribute network (tag network), so that the accuracy of the data tag and the data query efficiency can be improved. Moreover, the algorithm has good scalability, can process large-scale data sets, and meanwhile, the generated labels are more accurate than the traditional direct search algorithm.

C2. And adjusting the attribute data according to the data labels of the target attribute subsets to obtain a target data set.

For example, a candidate attribute value combination may be obtained, statistics of the candidate attribute value combination may be predicted based on data labels of the target attribute subset, and attribute data may be adjusted based on statistics corresponding to the candidate attribute value combination, to obtain the target data set.

Wherein the candidate attribute value combination may include a plurality of candidate attribute values. The candidate attribute combinations may include at least one attribute value combination that evaluates data quality and reliability of the data to be processed. Based on the data labels of the target attribute subset, the statistical value mode of predicting the candidate attribute value combination is similar to the statistical value mode of predicting the preset attribute value combination, and detailed description is omitted.

After the statistics of the candidate attribute value combinations are predicted, the attribute data can be adjusted based on the statistics corresponding to the candidate attribute value combinations. The manner in which the attribute data is adjusted is described in detail above, and will not be described in detail here.

As can be seen from the foregoing, after the data set to be processed is obtained, the embodiment of the present application uses the attribute subset in the attribute set corresponding to the data set to be processed as a node to construct an attribute network, then, based on the attribute network, statistics values of attribute value combinations corresponding to the node are counted in the data set to be processed, so as to obtain data labels of at least one candidate attribute subset in the attribute subsets, then, according to the data labels, statistics values of preset attribute value combinations in the data set to be processed are predicted, the attribute subset corresponding to the preset attribute set combination includes the candidate attribute subset, based on the predicted statistics values, a target attribute subset is selected from the candidate attribute subsets, and according to data labels of the target attribute subset, attribute data corresponding to each attribute in the data set to be processed is adjusted, so as to obtain the target data set; according to the scheme, the attribute network is constructed, the attribute network is used as a guide, the target attribute subset is screened out from the attribute subsets, and the statistical value of the limited attribute value combination in the target attribute subset is used as the data label to adjust the attribute data, so that the data applicability can be more accurately determined and the data deviation can be eliminated under the limited computing power resource, and the accuracy of data processing can be improved.

According to the method described in the above embodiments, examples are described in further detail below.

In this embodiment, the data processing apparatus is specifically integrated in an electronic device, and the electronic device is exemplified as a server.

As shown in fig. 4, a data processing method specifically includes the following steps:

201. the server obtains a data set to be processed.

For example, the server may directly obtain the data set to be processed uploaded by the terminal, or may screen attribute data of a plurality of attributes from a database, thereby obtaining the data set to be processed, or may obtain at least one training data set, screen a target training data set that may have a risk of data migration from the training data set, use the target training data set as the data set to be processed, or may obtain a plurality of attribute data of at least one object from a network or a database, thereby obtaining the data set to be processed, or may also accept a data processing request when a memory of the data to be processed is larger or a number of attribute data is larger, where the data processing request carries a storage address of the data to be processed, and obtain the data set to be processed based on the storage address.

202. The server determines attribute relationships between attributes in the set of attributes based on the attribute types of the attributes.

For example, the server may determine an attribute type of each attribute in the attribute set, screen out a target attribute corresponding to the attribute type from a preset association attribute set, determine that there is a correlation between attributes in the attribute set when there is a target attribute in the attribute set, determine that there is no correlation between attributes in the attribute set when there is no target attribute in the attribute set, or may also obtain attribute information such as a type of an attribute and an attribute identifier, identify an association attribute having an association relationship in the attribute set by using an association relationship identification model based on the attribute information, determine that there is a correlation between attributes when there is an association attribute, determine that there is no correlation between attributes when there is no association attribute, and so on.

203. When the correlation exists between the attributes, the server constructs an attribute network by taking the attribute subset in the attribute set as the node.

For example, when there is a correlation between attributes, the server may extract all attribute subsets from the attribute set, determine inclusion relationships between the attribute subsets based on elements in the attribute subsets, and construct a graph network using the attribute subsets as nodes based on the inclusion relationships, thereby obtaining an attribute network.

204. The server sorts each attribute in the attribute set, and based on the sorting result, adds each attribute as an element to the initial query queue to obtain the query queue.

For example, the server may sort each attribute in the attribute set based on the importance of the attribute in the attribute set, so as to obtain a sorting result corresponding to the attribute set, or may sort the attribute set based on the frequency of the attribute in the attribute set, so as to obtain a sorting result corresponding to the attribute set, or the like.

After ordering each attribute in the attribute set, the server can add each attribute as an element to the initial query queue, thereby obtaining the query queue after adding the element.

205. And the server screens out at least one candidate node from the nodes according to the attribute network and the query queue.

For example, the server may screen the first element in the query queue to obtain the target element, or may screen an element corresponding to the preset ordering position in the query queue to obtain the target element. And identifying the target node corresponding to the target element in the attribute network. And screening out other attributes except the target attribute in the attribute subset corresponding to the target node from the attribute set, traversing the nodes corresponding to at least one attribute subset consisting of the target attribute and any one or more other attributes in the attribute network, obtaining at least one child node, and taking the at least one child node as a candidate node.

206. And the server counts the candidate statistic value of the attribute value combination corresponding to the candidate node in the data set to be processed so as to obtain the data tag of at least one candidate attribute subset in the attribute subsets.

For example, the server may count the number of data or the number of occurrences corresponding to the attribute value combinations corresponding to the candidate nodes in the data set to be processed, thereby obtaining corresponding candidate statistics values, or may count the frequency of the attribute value combinations corresponding to the candidate nodes in the data set to be processed, thereby obtaining corresponding candidate statistics values, and so on.

The server may screen at least one attribute subset of which the number of candidate statistics values does not exceed a preset number threshold from the attribute subsets, obtain an initial candidate attribute subset, add the initial candidate attribute subset as an element to the query queue, delete a target element in the query queue, obtain an updated query queue, use the updated query queue as the query queue, and return to execute the step of screening at least one candidate node from the nodes according to the attribute network and the query queue until the initial candidate attribute subset does not exist, and use the initial candidate attribute subset as the candidate attribute subset.

The server may screen the combination statistics value corresponding to each attribute value combination in the candidate attribute subset from the candidate statistics values to obtain a combination statistics value set PC, statistics attribute statistics values of attribute values corresponding to each attribute in the data set to be processed to obtain an attribute statistics value set VC, and fuse the combination statistics value set PC with the attribute statistics value set VC to obtain a data tag, which may be described above, and will not be described in detail herein.

207. And the server predicts the statistical value of the preset attribute value combination in the data set to be processed according to the data label.

For example, the server may compare the preset attribute value combination with the attribute value combination corresponding to the candidate attribute subset, screen out a target attribute value combination including attribute values in the preset attribute value combination based on the comparison result, screen out at least one attribute value except for the attribute values in the target attribute value combination in the preset attribute value combination to obtain a target attribute value, or may determine a current set of attribute value combinations corresponding to the candidate attribute subset, screen out the attribute value combination corresponding to the subset of the preset attribute value combination in the current set to obtain a target attribute value combination, and screen out at least one attribute value except for the target attribute value combination in the preset attribute value combination to obtain a target attribute value.

The server can extract a combination statistical value corresponding to the target attribute value combination from the PC of the data tag to obtain a target combination statistical value, and extract a statistical value set of the attribute corresponding to the target attribute value from the VC set of the data tag to obtain a target attribute statistical value set.

The server can screen out the statistic value corresponding to the target attribute value from the target attribute statistic value set to obtain the target attribute statistic value. And accumulating the statistic values in the target attribute statistic value set to obtain the fusion statistic value of the attribute subset corresponding to the target attribute value. And determining the ratio between the target attribute statistic value and the corresponding fusion attribute statistic value, and fusing the ratio with the target combination statistic value to obtain a predicted statistic value, wherein the specific value can be shown in a formula (4).

208. The server screens the candidate attribute subset for a target attribute subset based on the prediction statistics.

For example, the server may traverse node relationships between nodes corresponding to the candidate attribute subsets in the attribute network, determine that there is an inclusion relationship between the candidate attribute subsets when there are parent nodes and child nodes in the nodes corresponding to the candidate attribute subsets, and determine that there is no node relationship between the candidate attribute subsets when the nodes corresponding to the candidate attribute subsets are all located at the same layer in the attribute network, i.e., there is no parent node and no child node between the nodes.

And when the server determines that the candidate attributes do not have the inclusion relationship, acquiring the labeling statistical value corresponding to the preset attribute value combination. And respectively taking the labeling statistical value and the forecast statistical value as molecules, and determining the ratio between the labeling statistical value and the forecast statistical value to obtain a ratio pair. And screening the maximum ratio from the ratio pairs to obtain a target ratio, and taking the target ratio as a prediction error of a prediction attribute value combination corresponding to the candidate attribute subset prediction.

The server determines the average value of the prediction error to obtain the prediction loss, or may also determine the mean square error (standard deviation) of the prediction error to obtain the prediction loss. And screening out the candidate attribute subset with the minimum prediction loss from the candidate attribute subsets to obtain a target attribute subset, or determining the prediction weight corresponding to the candidate attribute subset based on the attribute type in the candidate attribute subset, respectively weighting the prediction loss based on the prediction weight to obtain weighted prediction loss, screening out the candidate attribute subset with the minimum weighted prediction loss from the candidate attribute subset to obtain the target attribute subset, and the like.

And when the server determines that the inclusion relation exists between the candidate attributes, classifying the candidate attribute subsets based on the inclusion relation to obtain a first attribute subset and a second attribute subset, wherein the first attribute subset is contained in the second attribute subset, and determining that the error relation is that the prediction error of the second attribute subset is smaller than that of the first attribute subset. And screening the candidate attribute subset with the minimum prediction error from the candidate attribute subset, thereby obtaining the target attribute subset.

209. And the server adjusts the attribute data according to the data labels of the target attribute subsets to obtain a target data set.

For example, the server may obtain the candidate attribute value combinations, predict the statistics of the candidate attribute value combinations based on the data labels of the target attribute subsets, and predict the statistics of the candidate attribute value combinations in a manner similar to that of the preset attribute value combinations, which is described in detail above, and will not be repeated.

The server may determine an adjustment manner corresponding to the data set to be processed based on a data type of the data set to be processed, when the data set to be processed is training data, perform data correction on the data set to be processed based on a statistical value corresponding to a candidate attribute value combination to obtain a target data set, when the data set to be processed is financial data, perform data cleaning on the data set to be processed to obtain the target data set, and when the data set to be processed is an object recommended data set, screen at least one target object corresponding to the target attribute value combination from the data set to be processed based on the statistical value corresponding to the candidate attribute value combination to obtain the target data set, which may specifically be as follows:

when the data set to be processed is training data, the server can evaluate the data deviation of the data set to be processed based on the statistical value corresponding to the candidate attribute value combination, and when the data deviation exists in the data set to be processed, downsampling, oversampling or data enhancement is performed on the data set to be processed, so that a target data set is obtained.

When the data set to be processed is financial data, the server can identify abnormal data in the data set to be processed through the statistic value corresponding to the candidate attribute value combination, and clean the abnormal data in the data set to be processed to obtain a target data set. For example, the abnormal data may include abnormal object clusters of the interactive behavior, and the abnormal object clusters are deleted in the data set to be processed, so as to obtain a target data set.

When the data to be processed is the object recommendation data set, the server may screen at least one target attribute data with a statistical value corresponding to the candidate attribute value combination exceeding a preset statistical value threshold value from the data to be processed, and identify at least one target object in the target attribute data, thereby obtaining the target data set. For example, the to-be-processed data set is taken as the object historical interaction behavior data set, at least one target attribute value combination with the click rate or conversion rate higher than a preset threshold value is selected according to the difference of the client click rates or conversion rates of different attributes predicted by the statistical values, and the target object corresponding to the target attribute value combination is identified in the to-be-identified data set, so that the target data set is obtained.

After the target objects corresponding to the target attribute values are screened out from the data set to be processed, the server can acquire the content to be recommended corresponding to the advertisement, the commodity or other content which can be recommended, and the content to be recommended recommends each target object in the target data set.

210. When the server has no correlation among the attributes, the current statistical value of the attribute value corresponding to each attribute is counted in the data set to be processed, and the attribute data is adjusted based on the current statistical value to obtain a target data set.

For example, when there is no correlation between attributes, the server may traverse attribute data corresponding to each attribute in the data set to be processed, and count the number or frequency of occurrence of each attribute value in the attribute data, thereby obtaining the current statistical value.

The server may obtain a candidate attribute value combination, and extract a first statistical value of the candidate attribute value and a second statistical value of the attribute corresponding to the candidate attribute value from the current statistical value. And determining the statistical value ratio of the statistical value between the first statistical value and the second statistical value, and fusing the current statistical value to obtain a current fused statistical value. And accumulating the current fusion statistical value with the statistical value ratio corresponding to each candidate attribute value, thereby obtaining the statistical value corresponding to the candidate attribute value combination. To take the value of b in combination _i E {0,1}, candidate attribute value combinations (pattern p= { a _i1 ＝b _i1 ,,…,A _ik ＝b _ik E.g., the statistics of the candidate attribute value combinations may be as shown in equation (1).

After the server fuses the current fusion statistic value and the statistic value ratio, the attribute data can be adjusted based on the statistic value corresponding to the fused candidate attribute value, so that a target data set is obtained. The manner of adjusting the attribute data may be referred to above, and will not be described in detail herein.

As can be seen from the foregoing, after obtaining a data set to be processed, the server in this embodiment uses an attribute subset in an attribute set corresponding to the data set to be processed as a node to construct an attribute network, then, based on the attribute network, statistics values of attribute value combinations corresponding to the node are counted in the data set to be processed to obtain a data tag of at least one candidate attribute subset in the attribute subsets, then, according to the data tag, statistics values of preset attribute value combinations in the data set to be processed are predicted, the attribute subset corresponding to the preset attribute set combination includes the candidate attribute subset, based on the predicted statistics values, a target attribute subset is selected from the candidate attribute subsets, and according to a data tag of the target attribute subset, attribute data corresponding to each attribute in the data set to be processed is adjusted to obtain the target data set; according to the scheme, the attribute network is constructed, the attribute network is used as a guide, the target attribute subset is screened out from the attribute subsets, and the statistical value of the limited attribute value combination in the target attribute subset is used as the data label to adjust the attribute data, so that the data applicability can be more accurately determined and the data deviation can be eliminated under the limited computing power resource, and the accuracy of data processing can be improved.

In order to better implement the above method, the embodiment of the present invention further provides a data processing apparatus, where the data processing apparatus may be integrated into an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.

For example, as shown in fig. 5, the data processing apparatus may include an acquisition unit 301, a construction unit 302, a statistics unit 303, a prediction unit 304, and a screening unit 305, as follows:

(1) An acquisition unit 301;

the obtaining unit 301 is configured to obtain a data set to be processed, where the data set to be processed includes attribute data corresponding to each attribute in the attribute set.

For example, the obtaining unit 301 may specifically be configured to directly obtain a to-be-processed data set uploaded by the terminal, or may screen attribute data of a plurality of attributes from a database, thereby obtaining the to-be-processed data set, or may obtain at least one training data set, screen a target training data set that may have a risk of data migration from the training data set, use the target training data set as the to-be-processed data set, or may obtain a plurality of attribute data of at least one object from a network or a database, thereby obtaining the to-be-processed data set, or may further accept a data processing request when a memory of the to-be-processed data is larger or a number of attribute data is larger, where the data processing request carries a storage address of the to-be-processed data, and obtain the to-be-processed data set based on the storage address.

(2) A construction unit 302;

a construction unit 302 is configured to construct an attribute network with the attribute subset of the attribute set as nodes, where the attribute network indicates a containment relationship between the nodes.

For example, the construction unit 302 may specifically be configured to extract all attribute subsets from the attribute set, determine, based on the elements in the attribute subsets, a containment relationship between the attribute subsets, and construct a graph network with the attribute subsets as nodes based on the containment relationship, thereby obtaining the attribute network.

(3) A statistics unit 303;

and a statistics unit 303, configured to, based on the attribute network, count the statistics of the attribute value combinations corresponding to the nodes in the data set to be processed, so as to obtain the data tag of at least one candidate attribute subset in the attribute subsets.

For example, the statistics unit 303 may be specifically configured to sort each attribute in the attribute set, add each attribute as an element to the query queue based on the sorting result, screen at least one candidate node from the nodes according to the attribute network and the query queue, and count candidate statistics of attribute value combinations corresponding to the candidate node in the data set to be processed, so as to obtain a data tag of at least one candidate attribute subset in the attribute subsets.

(4) A prediction unit 304;

the predicting unit 304 is configured to predict, according to the data tag, a statistical value of a preset attribute value combination in the data set to be processed, where an attribute subset corresponding to the preset attribute value combination includes a candidate attribute subset.

For example, the prediction unit 304 may specifically be configured to determine, based on a preset attribute value combination, a target attribute value combination and a target attribute value corresponding to the candidate attribute subset, extract, in the data tag, a target combination statistics value corresponding to the target attribute value combination and a target attribute statistics value set corresponding to the target attribute value, and predict, based on the target combination statistics value and the target attribute statistics value combination, a statistics value of the preset attribute combination, so as to obtain a predicted statistics value.

(5) A screening unit 305;

the screening unit 305 is configured to screen a target attribute subset from the candidate attribute subsets based on the prediction statistics value, and adjust attribute data according to the data tag of the target attribute subset, so as to obtain a target data set.

For example, the filtering unit 305 may specifically be configured to query the attribute network for inclusion relationships between candidate attribute subsets, obtain a labeling statistical value corresponding to a preset attribute value combination when no inclusion relationship exists, compare the labeling statistical value with a prediction statistical value to obtain a prediction loss corresponding to the candidate attribute subset, screen the candidate attribute subset according to the prediction loss, adjust attribute data according to a data tag of the target attribute subset, and obtain the target data set.

Optionally, in some embodiments, the data processing apparatus may further include an adjustment unit 306, as shown in fig. 6, specifically may be as follows:

and the adjusting unit 306 is configured to, when there is no correlation between the attributes, count a current statistical value of the attribute values corresponding to each attribute in the to-be-processed dataset, and adjust the attribute data based on the current statistical value to obtain the target dataset.

For example, the adjusting unit 306 may be specifically configured to obtain a candidate attribute value combination, where the candidate attribute value group includes a plurality of candidate attribute values, extract a first statistical value of the candidate attribute values and a second statistical value of the attribute corresponding to the candidate attribute values from the current statistical values, determine a statistical value ratio between the first statistical value and the second statistical value, fuse the current statistical value to obtain a current fused statistical value, fuse the current fused statistical value with the statistical value ratio to obtain a statistical value corresponding to the candidate attribute value combination, and adjust the attribute data based on the statistical value corresponding to the candidate attribute value combination to obtain the target data set.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the foregoing, in this embodiment, after the obtaining unit 301 obtains the data set to be processed, the constructing unit 302 uses the attribute subset in the attribute set corresponding to the data set to be processed as a node to construct an attribute network, then the statistics unit 303 counts the statistics value of the attribute value combination corresponding to the node in the data set to be processed based on the attribute network, so as to obtain the data tag of at least one candidate attribute subset in the attribute subset, then the predicting unit 304 predicts the statistics value of the preset attribute value combination in the data set to be processed according to the data tag, the attribute subset corresponding to the preset attribute set combination includes the candidate attribute subset, the screening unit 305 screens the target attribute subset from the candidate attribute subset based on the predicted statistics value, and adjusts the attribute data corresponding to each attribute in the data set to be processed according to the data tag of the target attribute subset, so as to obtain the target data set; according to the scheme, the attribute network is constructed, the attribute network is used as a guide, the target attribute subset is screened out from the attribute subsets, and the statistical value of the limited attribute value combination in the target attribute subset is used as the data label to adjust the attribute data, so that the data applicability can be more accurately determined and the data deviation can be eliminated under the limited computing power resource, and the accuracy of data processing can be improved.

The embodiment of the invention also provides an electronic device, as shown in fig. 7, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:

the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 7 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

For example, the electronic device may obtain a data set to be processed including attribute data corresponding to each attribute in the set of attributes, and determine an attribute relationship between the attributes in the set of attributes based on the attribute type of the attribute. When there is a correlation between attributes, an attribute network is constructed with a subset of attributes in the attribute set as nodes. And sequencing each attribute in the attribute set, and adding each attribute as an element to the initial query queue based on the sequencing result to obtain a query queue. And screening at least one candidate node from the nodes according to the attribute network and the query queue. And counting candidate statistic values of attribute value combinations corresponding to the candidate nodes in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subsets. Determining a target attribute value combination and a target attribute value corresponding to a candidate attribute subset based on a preset attribute value combination, extracting a target combination statistical value corresponding to the target attribute value combination and a target attribute statistical value set corresponding to the target attribute value from the data tag, predicting a statistical value of the preset attribute combination based on the target combination statistical value and the target attribute statistical value set to obtain a predicted statistical value, inquiring a inclusion relation between candidate attribute subsets in an attribute network, acquiring a labeling statistical value corresponding to the preset attribute value combination when the inclusion relation does not exist, comparing the labeling statistical value with the predicted statistical value to obtain a predicted loss corresponding to the candidate attribute subset, and screening the candidate attribute subset according to the predicted loss; and classifying the candidate attribute subsets based on the inclusion relationship when the inclusion relationship exists, obtaining a first attribute subset and a second attribute subset, wherein the first attribute subset is contained in the second attribute subset, determining that the error relationship is that the prediction error of the second attribute subset is smaller than that of the first attribute subset, and screening the target attribute subset from the candidate attribute subsets based on the error relationship. And adjusting the attribute data according to the data labels of the target attribute subsets to obtain a target data set. When the correlation does not exist between the attributes, the current statistical value of the attribute value corresponding to each attribute is counted in the data set to be processed, a candidate attribute value combination is obtained, the candidate attribute value combination comprises a plurality of candidate attribute values, the first statistical value of the candidate attribute value and the second statistical value of the attribute corresponding to the candidate attribute value are extracted from the current statistical value, the statistical value ratio between the first statistical value and the second statistical value is determined, the current statistical value is fused, the current fusion statistical value is obtained, the current fusion statistical value is fused with the statistical value ratio, the statistical value corresponding to the candidate attribute value combination is obtained, the attribute data is adjusted based on the statistical value corresponding to the candidate attribute value combination, and the target data set is obtained.

The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present invention provides a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the data processing methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because the instructions stored in the computer readable storage medium may execute the steps in any data processing method provided by the embodiments of the present application, the beneficial effects that any data processing method provided by the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.

Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the data processing aspects or the data bias elimination aspects described above.

The foregoing has described in detail a data processing method, apparatus, electronic device and computer readable storage medium according to embodiments of the present invention, and specific examples have been applied to illustrate the principles and embodiments of the present invention, where the foregoing examples are provided to assist in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims

1. A method of data processing, comprising:

2. The data processing method according to claim 1, wherein the counting, based on the attribute network, the statistics of attribute value combinations corresponding to the nodes in the data set to be processed to obtain the data tag of at least one candidate attribute subset from the attribute subsets includes:

sorting each attribute in the attribute set, and adding each attribute as an element to an initial query queue based on a sorting result to obtain a query queue;

screening at least one candidate node from the nodes according to the attribute network and the query queue;

and counting candidate statistic values of attribute value combinations corresponding to the candidate nodes in the data set to be processed to obtain data labels of at least one candidate attribute subset in the attribute subsets.

3. The data processing method according to claim 2, wherein the screening at least one candidate node among the nodes according to the attribute network and the query queue comprises:

Screening out target elements from the query queue based on the sorting result;

identifying a target node corresponding to the target element in the attribute network;

traversing at least one sub-node corresponding to the target node in the attribute network to obtain the candidate node, wherein the attribute subset corresponding to the target node is contained in the attribute subset corresponding to the sub-node.

4. The data processing method according to claim 2, wherein the counting candidate statistics of attribute value combinations corresponding to the candidate nodes in the data set to be processed to obtain the data tag of at least one candidate attribute subset of the attribute subsets includes:

counting candidate statistical values of attribute value combinations corresponding to the candidate nodes in the data set to be processed;

screening at least one candidate attribute subset from the attribute subsets based on the candidate statistics;

and determining the data label of the candidate attribute subset according to the statistical value corresponding to the candidate attribute subset.

5. The data processing method of claim 4, wherein the screening at least one candidate attribute subset from among the attribute subsets based on the candidate statistics comprises:

Screening at least one attribute subset of which the number of the candidate statistical values does not exceed a preset number threshold value from the attribute subsets to obtain an initial candidate attribute subset;

adding the initial candidate attribute subset as an element to the query queue, and deleting the target element in the query queue to obtain an updated query queue;

and taking the updated query queue as the query queue, returning to execute the step of screening at least one candidate node from the nodes according to the attribute network and the query queue until the initial candidate attribute subset does not exist, and taking the initial candidate attribute subset as the candidate attribute subset.

6. The method according to claim 4, wherein determining the data tag of the candidate attribute subset according to the statistics corresponding to the candidate attribute subset comprises:

screening out a combination statistical value corresponding to each attribute value combination in the candidate attribute subset from the candidate statistical values to obtain a combination statistical value set;

counting attribute statistics values of attribute values corresponding to each attribute in the data set to be processed to obtain an attribute statistics value set;

And constructing the data labels of the candidate attribute subsets based on the combined statistic set and the attribute statistic set.

7. The method according to any one of claims 1 to 6, wherein predicting, based on the data tag, a statistical value of a preset attribute value combination in the data set to be processed, includes:

determining a target attribute value combination and a target attribute value corresponding to the candidate attribute subset based on the preset attribute value combination, wherein the target attribute value comprises at least one attribute value except the target attribute value combination in the preset attribute value combination;

extracting a target combination statistical value corresponding to the target attribute value combination and a target attribute statistical value set corresponding to the target attribute value from the data tag;

and predicting the statistical value of the preset attribute combination based on the target combination statistical value and the target attribute statistical value set to obtain the predicted statistical value.

8. The method of claim 7, wherein determining the target attribute value combination and the target attribute value corresponding to the candidate attribute subset based on the preset attribute value combination comprises:

Comparing the preset attribute value combination with the attribute value combination corresponding to the candidate attribute subset;

screening a target attribute value combination containing attribute values in the preset attribute value combination from the attribute value combination based on a comparison result;

and screening at least one attribute value except the attribute value in the target attribute value combination from the preset attribute value combination to obtain a target attribute value.

9. The method according to claim 7, wherein predicting the statistics of the preset attribute combinations based on the target combination statistics and the target attribute statistics set to obtain the predicted statistics comprises:

screening out the statistical value corresponding to the target attribute value from the target attribute statistical value set to obtain a target attribute statistical value;

fusing the statistical values in the target attribute statistical value set to obtain a fused attribute statistical value corresponding to the target attribute value;

and determining a ratio between the target attribute statistic value and the corresponding fusion attribute statistic value, and fusing the ratio with the target combination statistic value to obtain the prediction statistic value.

10. The method according to any one of claims 1 to 6, wherein the screening out a target attribute subset from the candidate attribute subsets based on the prediction statistics comprises:

querying the attribute network for inclusion relationships between candidate attribute subsets;

when the inclusion relationship does not exist, obtaining a labeling statistical value corresponding to the preset attribute value combination;

comparing the labeling statistical value with the prediction statistical value to obtain a prediction loss corresponding to the candidate attribute subset;

and screening the target attribute subset from the candidate attribute subset according to the prediction loss.

11. The method according to claim 10, wherein comparing the labeling statistic with the prediction statistic to obtain the prediction loss corresponding to the candidate attribute subset comprises:

respectively taking the labeling statistical value and the predicting statistical value as molecules, and determining the ratio between the labeling statistical value and the predicting statistical value to obtain a ratio pair;

screening out target ratio values in the ratio pairs to obtain prediction errors corresponding to the candidate attribute subsets;

And fusing the prediction errors to obtain the prediction loss corresponding to the candidate attribute subset.

12. The data processing method according to claim 10, wherein after querying the attribute network for inclusion relationships between candidate attribute subsets, further comprising:

when the inclusion relationship exists, classifying the candidate attribute subsets based on the inclusion relationship to obtain a first attribute subset and a second attribute subset, wherein the first attribute subset is included in the second attribute subset;

determining the error relationship as if the prediction error of the second subset of attributes is less than the prediction error of the first subset of attributes;

and screening the target attribute subset from the candidate attribute subsets based on the error relation.

13. The data processing method according to claim 1, wherein after the acquisition of the data set to be processed, further comprising:

determining attribute relationships between attributes in the attribute set based on attribute types of the attributes;

when no correlation exists between the attributes, counting the current statistical value of the attribute value corresponding to each attribute in the data set to be processed, and adjusting the attribute data based on the current statistical value to obtain the target data set;

The constructing the attribute network by taking the attribute subset in the attribute set as a node comprises the following steps: and when the correlation exists among the attributes, constructing an attribute network by taking the attribute subset in the attribute set as a node.

14. The data processing method according to claim 13, wherein said adjusting the attribute data based on the current statistics to obtain the target data set includes:

obtaining a candidate attribute value combination, wherein the candidate attribute value group comprises a plurality of candidate attribute values;

extracting a first statistical value of the candidate attribute value and a second statistical value of the attribute corresponding to the candidate attribute value from the current statistical value;

determining a statistic value ratio between the first statistic value and the second statistic value, and fusing the current statistic value to obtain a current fused statistic value;

and fusing the current fusion statistic value with the statistic value ratio to obtain a statistic value corresponding to the candidate attribute value combination, and adjusting the attribute data based on the statistic value corresponding to the candidate attribute value combination to obtain the target data set.

15. A data processing apparatus, comprising:

16. An electronic device comprising a processor and a memory, the memory storing an application, the processor being configured to run the application in the memory to perform the steps in the data processing method of any of claims 1 to 14.

17. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the data processing method of any of claims 1 to 14.

18. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the data processing method of any of claims 1 to 14.