CN115686597A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115686597A
CN115686597A CN202110836606.9A CN202110836606A CN115686597A CN 115686597 A CN115686597 A CN 115686597A CN 202110836606 A CN202110836606 A CN 202110836606A CN 115686597 A CN115686597 A CN 115686597A
Authority
CN
China
Prior art keywords
item set
frequent item
json
data
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110836606.9A
Other languages
Chinese (zh)
Inventor
尚保林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202110836606.9A priority Critical patent/CN115686597A/en
Publication of CN115686597A publication Critical patent/CN115686597A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention is suitable for the technical field of computers, and provides a data processing method, a data processing device, electronic equipment and a storage medium, wherein the data processing method comprises the following steps: determining at least one frequent item set based on hierarchical data of each of the at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; representing value sets of fields which appear in corresponding levels at the same time by a frequent item set; the hierarchical data represents fields and values of each hierarchy of corresponding Json data; determining the dependency relationship among fields in each frequent item set of at least one frequent item set; and generating a Json template based on the determined dependency relationship, and verifying the Json data to be verified based on the Json template.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
Json Object Notation (Json) data is a lightweight data exchange format and Json templates can be used to verify Json data. The Json template generated by the related technology only contains information such as data types, a user needs to determine the dependency relationship among fields in the Json data by himself before using the Json template, the dependency relationship is manually filled in the Json template, and if the user is not skilled enough, the determined dependency relationship is possibly inaccurate, and the Json template is not beneficial to the user to check the Json data.
Disclosure of Invention
In order to solve the above problem, embodiments of the present invention provide a data processing method, an apparatus, an electronic device, and a storage medium, so as to at least solve a problem that a Json template generated in a related technology is not beneficial for a user to verify Json data.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:
determining at least one frequent item set based on hierarchical data of each of the at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; the frequent item set represents a value set of fields which appear in a corresponding hierarchy at the same time; the hierarchical data represents fields and values of each hierarchy of corresponding Json data;
determining a dependency relationship between fields in each frequent item set of the at least one frequent item set;
generating a Json template based on the determined dependency relationship;
and checking the Json data to be checked based on the Json template.
In the foregoing solution, the generating a Json template based on the determined dependency includes:
and writing the dependency relationship into a set field of the Json template.
In the above solution, the determining at least one frequent item set based on hierarchical data of each of at least two Json data includes:
determining at least two candidate sets for each of the at least two tiers;
calculating a support degree of each candidate item in the at least two candidate items; the support degree characterizes the frequency of occurrence of corresponding candidate items in the at least two Json data;
determining a frequent item set for each of the at least two tiers based on the support.
In the foregoing solution, the determining a dependency relationship between fields in each frequent item set of the at least one frequent item set includes:
calculating information entropy gains corresponding to each field in each frequent item set of the at least one frequent item set;
determining the dependency based on the information entropy gain.
In the foregoing solution, the calculating an information entropy gain corresponding to each field in each frequent item set of the at least one frequent item set includes:
calculating the information entropy of each frequent item set of the at least one frequent item set;
calculating the conditional entropy corresponding to each field in each frequent item set of the at least one frequent item set;
and determining the information entropy gain corresponding to each field in the corresponding frequent item set based on the information entropy and the conditional entropy.
In the foregoing solution, determining the dependency relationship based on the information entropy gain includes:
and determining a field with the maximum information entropy gain in the frequent item set as a depended party in the dependency relationship.
In the foregoing solution, before determining at least one frequent item set, the method further includes:
and flattening each Json data of the at least two Json data to obtain hierarchical data of each Json data of the at least two Json data.
In a second aspect, an embodiment of the present invention provides a data processing apparatus, including:
the first determining module is used for determining at least one frequent item set based on the hierarchical data of each Json data in at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; the frequent item set represents a value set of fields which appear in a corresponding hierarchy at the same time; the hierarchical data represents fields and values of each hierarchy of corresponding Json data;
a second determining module, configured to determine a dependency relationship between fields in each frequent item set of the at least one frequent item set;
the generating module is used for generating a Json template based on the determined dependency relationship;
and the checking module is used for checking the Json data to be checked based on the Json template.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the steps of the data processing method provided in the first aspect of the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, including: the computer-readable storage medium stores a computer program. Which when executed by a processor performs the steps of the data processing method as provided by the first aspect of an embodiment of the invention.
The method and the device for verifying the Json data determine at least one frequent item set through the hierarchical data of each Json data in at least two Json data, determine the dependency relationship among fields in each frequent item set of the at least one frequent item set, generate a Json template based on the determined dependency relationship, and verify the Json data to be verified based on the Json template. Each frequent item set in at least one frequent item set corresponds to one of at least two levels corresponding to at least two Json data, the frequent item sets represent value sets of fields appearing in the corresponding levels at the same time, and the level data represent the fields and values of each level of the corresponding Json data. According to the embodiment of the invention, the dependency relationship among the fields is determined through the frequent item set, the Json template is generated based on the dependency relationship, compared with the related technology that a user needs to manually determine and write the dependency relationship, the embodiment of the invention does not need the user to manually determine and write the dependency relationship, the determined dependency relationship is more accurate, and when the Json template is used for verifying the Json data to be verified, the accuracy of the verification result can be improved.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of a data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of another implementation of a data processing method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of another implementation of a data processing method according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart illustrating another implementation of a data processing method according to an embodiment of the present invention;
fig. 5 is a flowchart of generating a json schema according to an embodiment of the present invention;
FIG. 6 is a diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Data validity checking is an important protection mechanism of internet and IT services, data safety and server safety maintenance can be protected through data validity checking, and lack of data validity checking can cause a plurality of safety problems such as command injection, service downtime and the like, so that data validity checking is particularly important in safety production.
The Json template may be used to check Json data, for example, json schema is a Json template, and the embodiment of the present invention is described by taking Json schema as an example.
The industry generally uses the Json template to check the validity of input data, so how to generate the high-quality Json template is a key ring for data checking. In the related technology, the read Json data are analyzed through a Json template generation tool to obtain different types of data types, the different types of data types are mapped to the data types of the Json template, and finally the Json template is obtained.
The Json template generated by the related technology is very rough, and before the Json template is used for checking the Json data, a user needs to additionally fill a lot of information related to the checked Json data in the Json template, such as dependency, naming rules, data size and the like among fields of the Json data. For a user who is not familiar with the block data, how to determine the dependency relationship and write the Json template is a difficult problem. And the dependence relationship determined manually may be inaccurate, which results in the generated Json template being inaccurate, and the verification result obtained when the user uses the Json template to verify the Json data is inaccurate.
In view of the foregoing disadvantages of the related art, embodiments of the present invention provide a data processing method, which can at least improve accuracy of a verification result of checking Json data using a Json template. In order to illustrate the technical means of the present invention, the following description is given by way of specific examples.
Fig. 1 is a schematic flow chart illustrating an implementation process of a data processing method according to an embodiment of the present invention, where an execution subject of the data processing method is an electronic device, and the electronic device includes a desktop computer, a notebook computer, a server, and the like. Referring to fig. 1, the data processing method includes:
s101, determining at least one frequent item set based on hierarchical data of each Json data in at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; the frequent item set represents a value set of fields which appear in a corresponding hierarchy at the same time; the hierarchical data characterizes fields and values of each hierarchy of corresponding Json data.
Here, at least two Json data are the basis for generating a Json template according to the embodiment of the present invention, where each of the Json data has a hierarchical structure, for example, one of the Json data is Json: the Json data comprises two levels, namely { "opr": list "," module ": root/", "filter": start ": sss", "cf": test "," success ": unkown" }, and the Json data comprises two levels, namely [ "opr", "module", "filter" ] and [ "start", "cf", "success" ].
The at least two Json data correspond to at least two levels in total, and the level number of the at least two levels is the level number of the Json data with the highest level in the at least two Json data.
And determining a frequent item set corresponding to each level, wherein the frequent item set refers to an item set which frequently appears in the data set, and the item set refers to a set of a plurality of items. In an embodiment of the present invention, for at least two Json data, at least one frequent item set may be mined for each hierarchy. The frequent item set is a set of fields and values of the fields that appear simultaneously in one hierarchy in the embodiment of the present invention. For example, the first level of one Json data is [ "opr", "module", "filter" ], and the first level of another Json data is [ "opr", "module" ], then the frequent item set of the first level is [ "opr", "module" ].
In an embodiment, prior to determining the at least one frequent item set, the method further comprises:
and flattening each Json data in the at least two Json data to obtain the hierarchical data of each Json data in the at least two Json data.
The Json data has the nested and layered characteristic, some Json data have a deep hierarchical structure, and in order to acquire detailed information of each layer of the Json data, the Json data are subjected to flattening processing, so that the deep-level data are exposed. The flattening processing can obtain detailed information of each level of the Json data, including obtaining fields in each level and values corresponding to the fields.
Referring to fig. 2, in an embodiment, determining at least one set of frequent items based on hierarchical data of each of at least two Json data includes:
s201, determining at least two candidate sets of each of the at least two levels.
The candidate item set is used for acquiring a frequent item set, and the item sets meeting the support degree condition in the candidate item set are reserved and the item sets not meeting the support degree condition are discarded.
At present, a plurality of relatively mature algorithms exist in frequent item set mining, and an Apriori algorithm can be used for mining the frequent item set in the embodiment of the invention. The Apriori algorithm is a frequent item set mining algorithm for extracting a frequent item set in association rules, namely mining combinations which appear in all things at the same time with a high probability. The association rule expresses the probability of pushing down a set of frequent items on condition of another set of frequent items. The Apriori algorithm is based on the priori knowledge of the property of a frequent item set, an iteration method of searching layer by layer from bottom to top is used, and the Apriori algorithm sequentially generates a candidate 1 item set to a candidate K item set from bottom to top.
In Apriori algorithm, two prior principles are used, 1, if a term set is not a frequent item set, the superset of the term set is not necessarily the frequent item set; 2. if a set of items is a frequent set of items, then a subset of the set of items is also a frequent set of items, and the number of candidate items can be greatly reduced using both a priori principles.
S202, calculating the support degree of each candidate item in the at least two candidate items; the support degree characterizes a frequency of occurrence of a corresponding candidate set in the at least two Json data.
The support degree refers to the frequency of occurrence of a certain set in all transactions, and in the embodiment of the invention, the support degree refers to the frequency or the number of occurrences of a candidate set in at least two Json data.
S203, determining a frequent item set of each level of the at least two levels based on the support degree.
Here, the candidate item in which the support degree is smaller than the minimum support degree min may be deleted, and the remaining candidate items may be determined as frequent item sets. In practical applications, only one frequent item set is usually reserved, and the candidate item set with the highest support degree can be determined as the frequent item set.
S102, determining the dependency relationship among the fields in each frequent item set of the at least one frequent item set.
Here, the dependency relationship means that values of fields in the frequent item set have dependencies, for example, one frequent item set includes a field a and a field B, where a value of the field a depends on a value of the field B, and then there is a dependency relationship between the field a and the field B, where the field a is a dependent party, the field B is a dependent party, and a value of the field a depends on a value of the field B.
The dependency relationship here may only refer to a size relationship of the value, for example, the value of the field a depends on the value of the field B, that is, the value of the field B is greater than the value of the field a. The dependency relationship may also refer to that the values of different fields conform to a functional relationship, for example, the value of field B is a multiple of the value of field a.
Assuming that the value of B depends on the value of a, the principle of the information entropy gain calculation dependency relationship is that the information entropy calculated by re-classifying the data set into categories according to the value of a is lower, that is, the information entropy gain is increased. Therefore, the data sets are divided according to A and B respectively and sequentially to calculate the information entropy gain, and the division scheme with the maximum gain is the dependency relationship.
Referring to fig. 3, in an embodiment, the determining a dependency relationship between fields in each frequent item set of the at least one frequent item set includes:
s301, calculating information entropy gain corresponding to each field in each frequent item set of the at least one frequent item set.
Information entropy refers to the probability of occurrence of certain specific information (the probability of occurrence of discrete random events). The information entropy means uncertainty, and the larger the entropy is, the larger the uncertainty is.
Conditional entropy refers to the uncertainty of a random variable under a condition.
The information entropy gain is equal to the information entropy minus the conditional entropy, which represents the degree to which the information uncertainty decreases under a condition.
Referring to fig. 4, in an embodiment, the calculating an information entropy gain corresponding to each field in each frequent item set of the at least one frequent item set includes:
s401, calculating the information entropy of each frequent item set of the at least one frequent item set.
The information entropy of the frequent item set is calculated according to the following formula:
Figure BDA0003177311250000081
wherein D is a complete data set, i.e. all fields and corresponding values in the frequent item set, P k The ratio of each combination value in D is taken as the ratio.
For example, the frequent item set includes fields a, B, and C, where a corresponds to a value a, B corresponds to B, and C corresponds to C. Assuming that a value combination (a 1, b1, c 1) has 3 groups in the frequent item set, and the frequent item set includes 17 groups of values in total, then P of the value combination is P k Is 3/17.
S402, calculating the conditional entropy corresponding to each field in each frequent item set of the at least one frequent item set.
For each frequent item set, selecting a field, and dividing D and D by the value of the field V And respectively calculating the information entropy weighted sum of each sub-data set for the divided sub-data sets to obtain the conditional entropy.
For example, assuming that the frequent item set includes fields a, B, and C, the combination of values of the frequent item set includes: there are 3 groups (a 1, b1, c 1), (10 groups (a 1, b2, c 1) and 4 groups (a 2, b1, c 2).
If D is the above 17 value combinations, the information entropy is:
Figure BDA0003177311250000082
and respectively calculating the conditional entropy when the A, the B and the C are taken as first nodes, and selecting the node with the maximum gain as a root node.
For example, the conditional entropy corresponding to the B field is:
Figure BDA0003177311250000083
similarly, the conditional entropy corresponding to the a and C fields can be calculated.
S403, determining an information entropy gain corresponding to each field in the corresponding frequent item set based on the information entropy and the conditional entropy.
Continuing to calculate the information entropy gain corresponding to the B field by using the above example, where the information entropy gain corresponding to the B field is: ent (D) -Ent (Y | B); the information entropy gain corresponding to the A field is as follows: ent (D) -Ent (Y | A).
By analogy, the information entropy gain corresponding to each field can be obtained.
S302, determining the dependency relationship based on the information entropy gain.
Based on the above example, the node with the largest information entropy gain is selected as the partition node, and assuming that the B field is selected, a and C are considered to be dependent on B. As for the dependency relationship between a and C, it can also be calculated according to the above method, and in practical applications, it is usually not necessary to mine too complex dependency relationship because the too complex dependency relationship is not necessarily accurate.
In an embodiment, determining the dependency based on the information entropy gain comprises:
and determining a field with the maximum information entropy gain in the frequent item set as a depended party in the dependency relationship.
For example, in the above example, if the information entropy gain corresponding to the B field is the largest, the B field is determined as the depended party, and a and C depend on B.
S103, generating a Json template based on the determined dependency relationship.
For example, the Json template is Json schema, which is used to describe Json data and may be used to check the Json data.
In an embodiment, the generating a Json template based on the determined dependencies includes:
and writing the dependency relationship into a set field of the Json template.
Here, the setting field may be a "description" field, and the description "field is used to describe field information, and is used as a prompt for a user, so that the user can check the Json data conveniently.
The process of generating the Json template may refer to a correlation technique, and for example, different types of data types may be mapped to the data type of the Json template to obtain the Json template.
For each level, a plurality of dependency relationships can be obtained, the dependency relationships can be written into the Json template, and a user can check the Json data according to the Json template when checking the Json data.
In practical application, a plurality of Json schemas can be generated according to at least two Json data, each Json schema is compared, and the same Json schemas are merged.
And S104, checking the Json data to be checked based on the Json template.
When the Json template is used for checking the Json data, the method comprises the following steps:
step one, obtaining each field of Json data to be checked and a corresponding value.
And step two, acquiring the dependency relationship in the Json template.
The first step and the second step can be executed simultaneously, or any one of the steps can be executed first.
And step three, judging whether each field of the Json data to be checked and the corresponding value meet the dependency relationship. And if yes, the Json data to be checked passes the check. If the Json data to be checked does not pass the check, the Json data to be checked does not pass the check.
For example, assuming that a batch of HTTP request data is placed according to different URLs, multiple request data are placed under each URL, and each request data is in Json format, a unique Json schema capable of matching each URL is generated.
Firstly, field information and field values of each level are obtained, frequent item sets of each level are mined by a frequent item set mining algorithm, the conditional entropy of the field values corresponding to each frequent item set is calculated, the maximum information entropy gain is calculated, and the division is used as the dependency relationship between fields. And generating a JsonSchema main body, and writing the dependency relationship into the JsonSchema.
Assume that the code to generate the JsonSchema is as follows:
"title":"This is a schema that matches body of url:..cgibinaccess_saas.cgi.please pay much more attention to the following key properties:{'opr':0.9036402569593148,'module':0.9271948608137045,'cf':0.9400428265524625},values mean support degree.the possible dependency relation is‘opr’(3.2)<-[‘module’(1.8),‘cf’(0.6)],values mean information gain."
wherein, the frequent item set is { 'opr':0.9036402569593148, 'module': 0.9271948637045, 'cf':0.9400428265524625}. The dependency relationship is 'opr' (3.2) < - [ 'module' (1.8) [, 'cf' (0.6) ], the dependency relationship represents that the values of the 'cf' and 'module' fields depend on the value of the 'opr' field, wherein 3.2, 1.8 and 0.6 represent the size relationship of the field value, namely the value of the 'opr' field is greater than the value of the 'module' field is greater than the value of the 'cf' field, and the specific value is not limited as long as the size relationship is met.
Suppose that there is a Json data to be checked as an HTTP request, and the body data segment of the HTTP request is:
“start”:0,
“limit”:50,
“cf”:“c64da58266dbd3a1ea960596e94515ac”,
“module”:“/mod-monitor/abnormal-traffic/index”
the Json data is checked by the JsonSchema, and the Json data does not pass the check because the opr field is lacked and the dependency relationship does not conform to the Json data.
After the HTTP request is modified according to the verification result, the body data segment of the modified HTTP request is as follows:
“opr”:2,
“start”:0,
“limit”:50,
“cf”:0,
“module”:“1”
and checking the Json data by using the JsonSchema, wherein the fields of 'cf', 'module' and 'opr' are not missing, and the values of the fields of 'cf', 'module' and 'opr' accord with the dependency relationship, so that the checking is passed.
The request data is subjected to validity check through the Json template, only the checked request data is allowed to be accessed, the request data which does not pass the check is intercepted, and the safety of the data is enhanced.
The method and the device are suitable for various data verification scenes, such as HTTP request verification, web parameter verification, background API verification and the like, and can use the technical scheme of the embodiment of the invention.
According to the embodiment of the invention, at least one frequent item set is determined through the hierarchical data of each Json data in at least two Json data. Determining the dependency relationship among fields in each frequent item set of at least one frequent item set, generating a Json template based on the determined dependency relationship, and checking Json data to be checked based on the Json template. The Json template is used for describing Json data, each frequent item set in at least one frequent item set corresponds to one of at least two levels corresponding to at least two Json data, the frequent item sets represent value sets of fields appearing in the corresponding levels at the same time, and the level data represent the fields and values of each level of the corresponding Json data. According to the embodiment of the invention, the dependency relationship among the fields is determined through the frequent item set, and the Json template is generated based on the dependency relationship. The determined dependency relationship is more accurate, and when the Json template is used for verifying the Json data to be verified, the accuracy of the verification result can be improved.
Referring to fig. 5, fig. 5 is a flowchart of generating json schema according to an application embodiment of the present invention, where the process of generating json schema includes:
first, data is loaded, and a batch of Json data is loaded. Preprocessing the Json data in batches, and acquiring fields and values of all levels.
And then, mining key fields, and calculating a frequent item set by using a frequent item set mining algorithm and taking the frequent item set as the key fields for each level. Here, apriori algorithm can be used to mine the frequent item set.
And then, carrying out dependency relationship mining, calculating information entropy gain corresponding to each field in each frequent item set for each frequent item set, and determining a dependency relationship according to the information entropy gain.
And finally, generating the JsonSchema, and writing the dependency relationship into a description field of the JsonSchema.
The JsonSchema generated by the application embodiment of the invention can be directly used for checking Json data, so that a user is prevented from manually inputting information into the JsonSchema, and the operation experience of the user is improved. And the determined dependency relationship is more accurate, and the accuracy of the check result of the Json data checked by the Json template can be improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The technical means described in the embodiments of the present invention may be arbitrarily combined without conflict.
In addition, in the embodiments of the present invention, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.
Referring to fig. 6, fig. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes: the device comprises a first determining module, a second determining module and a generating module.
The first determining module is used for determining at least one frequent item set based on the hierarchical data of each Json data in at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; the frequent item set represents a value set of fields which appear in a corresponding hierarchy at the same time; the hierarchical data represents fields and values of each hierarchy of corresponding Json data;
a second determining module, configured to determine a dependency relationship between fields in each frequent item set of the at least one frequent item set;
the generating module is used for generating a Json template based on the determined dependency relationship;
and the checking module is used for checking the Json data to be checked based on the Json template.
In one embodiment, the generating module generates a Json template based on the determined dependencies for:
and writing the dependency relationship into a set field of the Json template.
In an embodiment, the first determining module, when determining the at least one frequent item set based on hierarchical data of each of the at least two Json data, is configured to:
determining at least two candidate sets for each of the at least two tiers;
calculating a support degree of each candidate item in the at least two candidate items; the support degree characterizes the frequency of occurrence of corresponding candidate items in the at least two Json data;
determining a frequent item set for each of the at least two hierarchies based on the support.
In one embodiment, the second determining module, when determining the dependency between the fields in each of the at least one frequent item set, is configured to:
calculating information entropy gains corresponding to each field in each frequent item set of the at least one frequent item set;
determining the dependency based on the information entropy gain.
In an embodiment, the second determining module, when calculating the information entropy gain corresponding to each field in each frequent item set of the at least one frequent item set, is configured to:
calculating the information entropy of each frequent item set of the at least one frequent item set;
calculating the conditional entropy corresponding to each field in each frequent item set of the at least one frequent item set;
and determining the information entropy gain corresponding to each field in the corresponding frequent item set based on the information entropy and the conditional entropy.
In an embodiment, the second determination module, when determining the dependency based on the information entropy gain, is to:
and determining a field with the largest information entropy gain in the frequent item set as a depended party in the dependency relationship.
In one embodiment, the apparatus further comprises:
the processing module is used for carrying out flattening processing on each Json data in the at least two Json data to obtain the hierarchical data of each Json data in the at least two Json data.
In practical applications, the first determining module, the second determining module and the generating module may be implemented by a Processor in an electronic device, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable gate array (FPGA).
It should be noted that: in the data processing apparatus provided in the foregoing embodiment, when performing data processing, only the division of the above modules is exemplified, and in practical applications, the processing may be distributed to different modules as needed, that is, the internal structure of the apparatus may be divided into different modules to complete all or part of the processing described above. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Based on the hardware implementation of the program module, in order to implement the method of the embodiment of the present application, an embodiment of the present application further provides an electronic device. Fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application, where as shown in fig. 7, the electronic device includes:
the communication interface can carry out information interaction with other equipment such as network equipment and the like;
and the processor is connected with the communication interface to realize information interaction with other equipment, and is used for executing the method provided by one or more technical schemes on the electronic equipment side when running a computer program. And the computer program is stored on the memory.
Of course, in practice, the various components in an electronic device are coupled together by a bus system. It will be appreciated that a bus system is used to enable the communication of the connections between these components. The bus system includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as a bus system in fig. 7.
The memory in the embodiments of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), synchronous Dynamic Random Access Memory (SLDRAM), direct Memory (DRmb Access), and Random Access Memory (DRAM). The memories described in the embodiments of the present application are intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed by the embodiment of the present application can be applied to a processor, or can be implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in a memory where a processor reads the programs in the memory and in combination with its hardware performs the steps of the method as previously described.
Optionally, when the processor executes the program, the corresponding process implemented by the electronic device in each method of the embodiment of the present application is implemented, and for brevity, is not described again here.
In an exemplary embodiment, the present application further provides a storage medium, specifically a computer storage medium, for example, a first memory storing a computer program, where the computer program is executable by a processor of an electronic device to perform the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or in other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps of implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer-readable storage medium, and when executed, executes the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The technical means described in the embodiments of the present application may be arbitrarily combined without conflict.
In addition, in the examples of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of data processing, the method comprising:
determining at least one frequent item set based on hierarchical data of each of the at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; the frequent item set represents a value set of fields which appear in a corresponding hierarchy at the same time; the hierarchical data represents fields and values of each hierarchy of corresponding Json data;
determining a dependency relationship between fields in each frequent item set of the at least one frequent item set;
generating a Json template based on the determined dependency relationship;
and checking the Json data to be checked based on the Json template.
2. The method of claim 1, wherein generating a Json template based on the determined dependencies comprises:
and writing the dependency relationship into a set field of the Json template.
3. The method of claim 1, wherein determining at least one frequent item set based on hierarchical data of each of at least two Json data comprises:
determining at least two candidate sets for each of the at least two tiers;
calculating a support degree of each candidate item set of the at least two candidate item sets; the support degree characterizes the frequency of occurrence of corresponding candidate items in the at least two Json data;
determining a frequent item set for each of the at least two hierarchies based on the support.
4. The method of claim 1, wherein said determining dependencies between fields in each of said at least one frequent item set comprises:
calculating information entropy gains corresponding to each field in each frequent item set of the at least one frequent item set;
determining the dependency based on the information entropy gain.
5. The method of claim 4, wherein said calculating information entropy gains for each field in each of said at least one set of frequent items comprises:
calculating the information entropy of each frequent item set of the at least one frequent item set;
calculating the conditional entropy corresponding to each field in each frequent item set of the at least one frequent item set;
and determining the information entropy gain corresponding to each field in the corresponding frequent item set based on the information entropy and the conditional entropy.
6. The method of claim 4, wherein determining the dependency based on the information entropy gain comprises:
and determining a field with the largest information entropy gain in the frequent item set as a depended party in the dependency relationship.
7. The method of claim 1, wherein prior to determining at least one frequent item set, the method further comprises:
and flattening each Json data in the at least two Json data to obtain the hierarchical data of each Json data in the at least two Json data.
8. A data processing apparatus, comprising:
the first determining module is used for determining at least one frequent item set based on the hierarchical data of each Json data in at least two Json data; each frequent item set in the at least one frequent item set corresponds to one of at least two levels corresponding to the at least two Json data; the frequent item set represents a value set of fields which appear in a corresponding hierarchy at the same time; the hierarchical data represents fields and values of each hierarchy of corresponding Json data;
a second determining module, configured to determine a dependency relationship between fields in each frequent item set of the at least one frequent item set;
the generating module is used for generating a Json template based on the determined dependency relationship;
and the checking module is used for checking the Json data to be checked based on the Json template.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data processing method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that it stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the data processing method of any one of claims 1 to 7.
CN202110836606.9A 2021-07-23 2021-07-23 Data processing method and device, electronic equipment and storage medium Pending CN115686597A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110836606.9A CN115686597A (en) 2021-07-23 2021-07-23 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110836606.9A CN115686597A (en) 2021-07-23 2021-07-23 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115686597A true CN115686597A (en) 2023-02-03

Family

ID=85044419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110836606.9A Pending CN115686597A (en) 2021-07-23 2021-07-23 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115686597A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116455753A (en) * 2023-06-14 2023-07-18 新华三技术有限公司 Data smoothing method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116455753A (en) * 2023-06-14 2023-07-18 新华三技术有限公司 Data smoothing method and device
CN116455753B (en) * 2023-06-14 2023-08-18 新华三技术有限公司 Data smoothing method and device

Similar Documents

Publication Publication Date Title
CN114298417A (en) Anti-fraud risk assessment method, anti-fraud risk training method, anti-fraud risk assessment device, anti-fraud risk training device and readable storage medium
CN110134965B (en) Method, apparatus, device and computer readable storage medium for information processing
CN107391682B (en) Knowledge verification method, knowledge verification apparatus, and storage medium
CN111563218B (en) Page repairing method and device
CN112507212A (en) Intelligent return visit method and device, electronic equipment and readable storage medium
CN113110843B (en) Contract generation model training method, contract generation method and electronic equipment
CN115686597A (en) Data processing method and device, electronic equipment and storage medium
US20240202344A1 (en) Use of word embeddings to locate sensitive text in computer programming scripts
WO2022089235A1 (en) Product demonstration method and apparatus, computer device, and storage medium
CN116955590B (en) Training data screening method, model training method and text generation method
CN112363814A (en) Task scheduling method and device, computer equipment and storage medium
CN111680083A (en) Intelligent multi-stage government financial data acquisition system and data acquisition method
CN114530215B (en) Method and apparatus for designing ligand molecules
CN113658711B (en) Medical data localization method, device, computer equipment and storage medium
CN113010550B (en) Batch object generation and batch processing method and device for structured data
CN113254455B (en) Dynamic configuration method and device of database, computer equipment and storage medium
Su et al. A Derivative‐Free Liu–Storey Method for Solving Large‐Scale Nonlinear Systems of Equations
Zdunek et al. Distributed geometric nonnegative matrix factorization and hierarchical alternating least squares–based nonnegative tensor factorization with the MapReduce paradigm
CN112231232A (en) Method, device and equipment for determining test data model and generating test data
CN111881220A (en) Data operation method and device under list storage, electronic equipment and storage medium
CN112685574B (en) Method and device for determining hierarchical relationship of domain terms
CN113886278B (en) Method and device for automatically exporting and verifying requirement attribute
CN117555950B (en) Data blood relationship construction method based on data center
CN114579046B (en) Cloud storage similar data detection method and system
CN115687512A (en) Risk data processing method, apparatus, device, medium, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination