CN118035220A - Structured data repair method, apparatus, electronic device and computer readable medium - Google Patents

Structured data repair method, apparatus, electronic device and computer readable medium Download PDF

Info

Publication number
CN118035220A
CN118035220A CN202410182674.1A CN202410182674A CN118035220A CN 118035220 A CN118035220 A CN 118035220A CN 202410182674 A CN202410182674 A CN 202410182674A CN 118035220 A CN118035220 A CN 118035220A
Authority
CN
China
Prior art keywords
data
structured data
structured
data set
integrated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410182674.1A
Other languages
Chinese (zh)
Other versions
CN118035220B (en
Inventor
葛殿永
王艺星
刘旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Park Road Credit Information Co ltd
Original Assignee
Park Road Credit Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Park Road Credit Information Co ltd filed Critical Park Road Credit Information Co ltd
Priority to CN202410182674.1A priority Critical patent/CN118035220B/en
Publication of CN118035220A publication Critical patent/CN118035220A/en
Application granted granted Critical
Publication of CN118035220B publication Critical patent/CN118035220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure disclose structured data repair methods, apparatuses, electronic devices, and computer readable media. One embodiment of the method comprises the following steps: collecting log data corresponding to each piece of product information to obtain a product log data set; generating a structured data set according to at least one unstructured data corresponding to the product log data set; user behavior determination is carried out on the structured data corresponding to each data level in the data level list, and a user behavior list is obtained; responding to the determined test result to represent the test abnormality, and carrying out data repair on the integrated structured data set to obtain a repaired structured data set; and storing the repaired structured data set to a log data analysis platform. The implementation method reduces the complexity among the data, reduces the loss of the data in the processing process, reduces the possibility of abnormality occurrence, and improves the safety of the data.

Description

Structured data repair method, apparatus, electronic device and computer readable medium
Technical Field
Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a computer readable medium for repairing structured data.
Background
Structured data repair is a technique for repairing structured data, and is typically implemented using a data processing tool (e.g., the programming language Python) when converting semi-structured data into structured data. When the structural data set is stored after the subsequent repair, the security of repairing the structural data can be improved by adjusting the abnormal information generated among the data. In addition, the efficiency of the structured data restoration can be improved through data integration when facing different kinds of data, and the period of the structured data restoration can also be shortened. Currently, structured data repair is generally performed in the following manner: and when the structured data is in a missing state or redundant state, the structured data is repaired by manual intervention.
However, when the above manner is adopted, there are often the following technical problems:
Firstly, as different kinds of data generated by product information are more and more, the complexity between the data is improved, the data cannot be comprehensively processed by manual intervention, so that the association relation between the data cannot be fully considered, the data is easy to generate loss in the processing process, the safety of the data is reduced, and the possibility of abnormality is improved.
Secondly, with the storage of more and more repaired structured data sets, the log data analysis platform occupies a larger memory of a user operating system, so that the load capacity of the user operating system is overlarge, and the stored structured data is easy to cause abnormity. Because the abnormal data cannot be dynamically adjusted, the performance of the user operating system is poor, and the user operating system may be caused to be faulty.
Third, when data repair is performed, due to higher association of complex data, data integration efficiency is lower, so that computational resources required by data repair are increased and time consumption is longer, and a difference exists between a computer carrying the structured data repair and the required resources, so that efficiency is lower, and a period of structured data repair is longer.
The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, may contain information that does not form the prior art that is already known to those of ordinary skill in the art in this country.
Disclosure of Invention
The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the present disclosure propose structured data repair methods, apparatus, electronic devices, and computer readable media to address one or more of the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a structured data repair method, the method comprising: collecting log data corresponding to each piece of product information to obtain a product log data set; generating a structured data set according to at least one unstructured data corresponding to the product log data set; carrying out format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set; data integration is carried out on at least one structured data corresponding to the structured data set in the product log data set and the parsed structured data set, so that an integrated structured data set is obtained; determining the data level of each integrated structured data in the integrated structured data set to obtain a data level list; user behavior determination is carried out on the structured data corresponding to each data level in the data level list, and a user behavior list is obtained; performing security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result; responding to the determination that the inspection result represents the inspection abnormality, and carrying out data repair on the integrated structured data set to obtain a repaired structured data set; and storing the repaired structured data set to a log data analysis platform.
In a second aspect, some embodiments of the present disclosure provide a structured data repair apparatus, the apparatus comprising: the acquisition unit is configured to acquire log data corresponding to each piece of product information to obtain a product log data set; a generation unit configured to generate a structured data group according to at least one unstructured data corresponding to the product log data set; the format analysis unit is configured to perform format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set; the data integration unit is configured to integrate at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set; a determining unit configured to determine a data hierarchy of each integrated structured data in the integrated structured data set, and obtain a data hierarchy list; the analysis unit is configured to determine the user behavior of the structured data corresponding to each data level in the data level list, so as to obtain a user behavior list; the security audit unit is configured to perform security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result; a data restoration unit configured to perform data restoration on the integrated structured data set to obtain a restored structured data set in response to determining that the inspection result represents an inspection anomaly; and the storage unit is configured to store the repaired structured data set to the log data analysis platform.
In a third aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors causes the one or more processors to implement the method described in any of the implementations of the first aspect above.
In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect above.
The above embodiments of the present disclosure have the following advantages: by the aid of the structured data restoration method, complexity among data is reduced, loss of the data in the processing process is reduced, possibility of occurrence of abnormality is reduced, and safety of the data is improved. Specifically, the complexity between data is increased, which easily causes loss of data in the processing process, so that the security of the data is reduced, and the possibility of occurrence of abnormality is increased because: along with the increasing number of different types of data generated by product information, the complexity between the data is improved, the data cannot be comprehensively processed by manual intervention, the association relation between the data cannot be fully considered, the data is easy to lose in the processing process, the safety of the data is reduced, and the possibility of abnormality is improved. Based on this, in the structured data restoration method of some embodiments of the present disclosure, first, log data corresponding to each product information is collected, and a product log data set is obtained. Thus, the product log data set can be obtained to facilitate subsequent operations. And then, generating a structured data group according to at least one unstructured data corresponding to the product log data set. Thus, unstructured data can be converted into structured data, thereby reducing complexity between the data. And then, carrying out format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set. Thereby, the readability of the data can be improved by format parsing. And secondly, integrating at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set. Therefore, a plurality of data sets can be integrated into one data set, so that the complexity between the data sets is reduced, and the loss generated in the subsequent processing process can be reduced. And then, determining the data level of each integrated structured data in the integrated structured data set to obtain a data level list. Therefore, the integrated structured data set can be hierarchically divided, the association relation among the data can be fully considered, and the loss of the data in the processing process is avoided. And then, carrying out user behavior determination on the structured data corresponding to each data hierarchy in the data hierarchy list to obtain a user behavior list. Thus, the security of data can be improved. And then, carrying out security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result. Thereby, the possibility of occurrence of abnormality can be reduced. And then, responding to the determination that the inspection result represents the inspection abnormality, and carrying out data repair on the integrated structured data set to obtain a repaired structured data set. Thus, the security of data can be improved. And finally, storing the repaired structured data set to a log data analysis platform, so that the safety of the data can be improved. Therefore, the complexity among the data is reduced, and the loss of the data in the processing process is reduced, so that the possibility of abnormality is reduced, and the safety of the data is improved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
FIG. 1 is a flow chart of some embodiments of a structured data repair method according to the present disclosure;
FIG. 2 is a schematic structural diagram of some embodiments of a structured data repair device according to the present disclosure;
fig. 3 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 is a flow 100 of some embodiments of a log data conversion method of the present disclosure. The log data conversion method comprises the following steps:
And 101, collecting log data corresponding to each piece of product information to obtain a product log data set.
In some embodiments, an execution subject (e.g., a computing device) of the log data conversion method may acquire log data corresponding to each product information through an associated business data acquisition device (e.g., a data acquisition device) to obtain a product log data set.
Here, the product information in the above-described respective product information may refer to information including product version information, production date information, product price, and product name. Here, the product log data in the product log data set may refer to data describing log activities of product information in the respective product information. The logging activity may include, but is not limited to, at least one of: product access records, product security prompts, user feedback notifications and false alarms.
Step 102, generating a structured data set according to at least one unstructured data corresponding to the product log data set.
In some embodiments, the execution body may generate the structured data set according to at least one unstructured data corresponding to the product log data set.
Here, the unstructured data in the at least one unstructured data may be product log data with irregular data structure, inconsistent presentation and free format. The structured data in the structured data set described above may refer to product log data organized according to a predefined structure and an explicit format.
As an example, the execution body may filter at least one unstructured data corresponding to the product log data set to obtain at least one filtered structured data. And then, carrying out data extraction on each piece of the at least one piece of post-screening structured data to generate extracted structured data, and obtaining an extracted structured data set. Wherein, the extracted structured data in the extracted structured data set may refer to data of a predefined field. For example, the predefined field may refer to a product access record. And then, carrying out data marking on each extracted structured data in the extracted structured data set to generate marked structured data, and obtaining a marked structured data set. Finally, the marked structured data set is determined as a structured data set.
Optionally, the executing entity may generate the structured data set according to at least one unstructured data corresponding to the product log data set by:
and a first step of carrying out text recognition on each unstructured data in the at least one unstructured data to generate recognized text data, and obtaining a recognized text data set.
As an example, the executing entity may perform text recognition on each unstructured data of the at least one unstructured data through a text recognition technology to generate recognized text data, and obtain a recognized text data set. The text recognition technique described above may refer to optical character recognition (Optical Character Recognition, OCR).
Second, for each identified text data in the identified text data set, performing the following processing steps:
a first sub-step of text segmentation of the identified text data to generate a segmented text data set.
As an example, the execution subject may perform semantic analysis on the recognized text data by a semantic analysis method (method of SEMANTIC DIFFERENTIAL) to generate analyzed text data. And then, carrying out text segmentation on the analyzed text data according to punctuation marks so as to generate a segmented text data set.
And a second sub-step, extracting keywords from the segmented text data set to generate an extracted text keyword group.
As an example, the execution body may perform keyword extraction on the segmented text data set using a Term Frequency-inverse document Frequency algorithm (Term Frequency-Invers Document Frequency, TF-IDF) to generate an extracted text keyword group.
And a third sub-step, performing part-of-speech tagging on the extracted text keyword group to obtain a text keyword group with the part-of-speech tagged.
As an example, the execution subject may input the extracted text keyword group into a Seq2Seq model to obtain a text keyword group with a part of speech tagged. The Seq2Seq model is a model for determining the part of speech corresponding to each extracted text keyword in the extracted text keyword group so as to generate the extracted text keyword.
And thirdly, screening out part-of-speech information from the obtained part-of-speech tagged text keyword group set to obtain a target part-of-speech text keyword set, wherein the part-of-speech information is part-of-speech tagged text keyword of the target part-of-speech information.
Here, the target part-of-speech information may refer to part-of-speech information in which the target part-of-speech is a verb.
And step four, data extraction is carried out on at least one target part-of-speech text keyword in the target part-of-speech text keyword set, and extracted keyword groups are obtained.
As an example, the execution body may perform data extraction on at least one target part-of-speech text keyword in the target part-of-speech text keyword set through a regular expression, so as to obtain an extracted keyword group.
And fifthly, converting each extracted keyword in the extracted keyword groups according to a preset structure to obtain a structured data set. Here, the preset structure may refer to a dictionary structure that is preset. For example, the above preset structure may refer to { "access exception": "string", "error warning": "string" }. Wherein, the "access abnormality" and the "false alarm" may both refer to the extracted keywords. "string" may refer to a string type.
As an example, the execution body may convert each extracted keyword in the extracted keyword group into data with a preset structure, to obtain a structured data set.
And 103, carrying out format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set.
In some embodiments, the execution body may perform format parsing on at least one semi-structured data corresponding to the product log dataset to obtain a parsed structured data set.
As an example, the execution body may perform format analysis on at least one semi-structured data corresponding to the product log data set by using a JSON analysis library, to obtain an analyzed structured data set.
And 104, integrating at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set.
In some embodiments, the execution body may integrate at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set.
As an example, the execution body may extract the keyword from at least one structured data corresponding to the structured data set in the product log data set and the parsed structured data set, to obtain an extracted structured data set. And then, merging the extracted structured data in the extracted structured data set to obtain a merged structured data set, wherein the merged structured data set is used as an integrated structured data set.
Optionally, the executing body may integrate the structured data set with the parsed structured data set by the following steps, where at least one structured data corresponding to the product log dataset is integrated with the parsed structured data set to obtain an integrated structured dataset:
And a first step of carrying out data preprocessing on the at least one structured data to obtain a preprocessed structured data set.
Here, the above-described data preprocessing may be data normalization processing.
And secondly, creating a data perspective empty table.
Here, the above-mentioned pivot empty table may be a pivot table in which data is empty.
As an example, the execution body may create a pivot table using Excel.
And thirdly, filling each preprocessed structured data in the preprocessed structured data set and each structured data in the structured data set into the data perspective empty table to obtain an added data perspective table, wherein the added data perspective table is a data perspective table taking each preprocessed structured data in the preprocessed structured data set as column data and each structured data in the structured data set as row data.
And step four, performing field matching on each row of data in the added pivot table and each corresponding column of data to obtain a matched structured data set, wherein the matched structured data set in the matched structured data set comprises: one row data and one column data.
Here, the matched structured data set in the matched structured data set may represent a data set having an association relationship between one line data and a corresponding one column data. The association relationship may mean that one row data may contain a corresponding one column data. For example. One line of data may characterize a user name. The corresponding one of the column data may characterize Zhang three.
As an example, the execution body may traverse each row data and each corresponding column data in the added pivot table to obtain a traversed row data set and a traversed column data set. And then, performing row-column data combination on each traversal post data in the traversal post data set and the traversal post data in the traversal post data set to generate a combined data set, and obtaining a combined data set as a matched structured data set. The above-described row-column data combination may refer to a combination of one row data and one column data.
And fifthly, for each matched structured data set in the matched structured data set, performing data connection on each matched structured data in the matched structured data set to obtain connected structured data.
As an example, the execution body may fill the two-dimensional array with each matched structured data in the matched structured data set to obtain filled structured data, which is used as connected structured data. For example, the above-described post-pad structured data may refer to [ user name: zhang san ].
And sixthly, determining each obtained structured data after connection as an integrated data information set.
Optionally, the executing body may integrate the structured data set with the parsed structured data set by the following steps, where at least one structured data corresponding to the product log dataset is integrated with the parsed structured data set to obtain an integrated structured dataset:
The first step, carrying out knowledge graph conversion on at least one structured data corresponding to the structured data set in the product log data set and the parsed structured data set to obtain a structured data knowledge graph, wherein node information in the structured data knowledge graph is each structured data in the structured data set, each structured data in the at least one structured data and each parsed structured data in the parsed structured data set, and edges in the knowledge graph are relations between the node information and the node information, and the node information comprises: node attribute information and identifier information.
Here, the structured data knowledge graph may refer to a knowledge graph in which node information in a graph corresponding to structured data and a corresponding edge are connected. Here, the node attribute information may refer to attribute information of a map node. For example, the node attribute information may refer to name information of the node. The identifier information may be identity number information of the node. For example, the identifier information may refer to identity ID information. The above-mentioned identity ID information may be referred to as "number 001".
And secondly, carrying out node pruning on the structured data knowledge graph to obtain a knowledge graph after node pruning.
Here, the node pruning may delete and preserve the nodes by the importance degree of the nodes. The knowledge graph after pruning the nodes can refer to the knowledge graph after deleting the invalid nodes.
And thirdly, carrying out graph network analysis on the knowledge graph after pruning of the nodes to obtain the analyzed knowledge graph.
As an example, the executing body may analyze the node pruned knowledge graph by using a graph network analysis algorithm to obtain an analyzed knowledge graph. The graph network analysis algorithm described above may be referred to as the PageRank algorithm.
And fourthly, carrying out entity linking on each entity information in the analyzed knowledge graph and each entity information in the preset data source to obtain a linked entity information set.
Here, the above-mentioned preset data source may refer to wikipedia. The preset data sources can be used for linking the information of each entity in the preset data sources to the analyzed knowledge graph. The entity information in the above-mentioned respective entity information may include, but is not limited to, at least one of: entity attribute information, context information. The entity information in the entity information refers to entity information corresponding to the structured data in the analyzed knowledge graph.
As an example, the executing body may use the same entity name to perform entity linking on each entity information in the analyzed knowledge graph and each entity information in the preset data source, so as to obtain a linked entity information set. The entity names may refer to symbolic names. For example, the entity name may refer to a uniform resource locator (Universal Resource Locator, URL).
And fifthly, carrying out data consistency test on each piece of linked entity information in the linked entity information set to obtain a test result.
As an example, the execution body may compare the predefined attribute rules of each of the linked entity information in the linked entity information set to obtain a comparison result as a test result. The predefined attribute rules may refer to constraint rules of predefined attributes. For example, the predefined attribute rule may refer to that "the data types corresponding to the attribute information are the same, and each entity information in the preset data source is associated with each entity information in the analyzed knowledge graph". The data type of the attribute value corresponding to the attribute information may be integer. The degree of association between each entity information in the preset data source and each entity information in the analyzed knowledge graph can be represented by a similarity, wherein the similarity is greater than 80%.
And sixthly, in response to determining that the test results represent inconsistent tests, carrying out data alignment on at least one piece of linked entity information in the linked entity information set corresponding to the test results to obtain an aligned entity information set.
Here, the data types of the respective aligned entity information in the aligned entity information set are identical.
Seventhly, determining a scatter diagram of each aligned entity information in the aligned entity information set, and obtaining an entity information scatter diagram.
As an example, the execution body may perform hierarchical clustering on each aligned entity information in the aligned entity information set to obtain a hierarchical clustering result. And then, carrying out scatter diagram data visualization on the hierarchical clustering result to obtain an entity information scatter diagram. Here, the specific setting of the preset number of clusters is not limited. For example, the preset number of clusters may refer to 100.
And eighth step, determining fitting lines of the entity information scatter diagram.
Here, the above-described fit line may refer to a straight line or a curve approximately representing a distribution trend in the solid information scatter diagram.
As an example, the execution subject may determine the fit line in the entity information scatter plot by a regression algorithm.
And a ninth step of determining the distances between the fitting lines and the respective aligned entity information in the aligned entity information set as an entity information fitting distance set.
And tenth, eliminating the entity information fitting distance which does not meet the preset range condition in the entity information fitting distance set, and obtaining an entity information fitting distance set after elimination.
Here, the above-mentioned preset range condition may mean that "the range of the entity information fitting distance in the entity information fitting distance set is (0.2,0.9)".
And eleventh step, encrypting the entity information corresponding to each rejected entity information fitting distance in the rejected entity information fitting distance set to obtain an encrypted entity information set which is used as an integrated structured data set.
As an example, the executing body may encrypt, by using an encryption algorithm, entity information corresponding to each of the fitting distances of the post-elimination entity information in the fitting distances of the post-elimination entity information, to obtain an encrypted entity information set, which is used as the integrated structured data set. The encryption algorithm may be a digital signature algorithm (Digital Signature Algorithm, DSA).
The related matters in the first step to the eleventh step are taken as an invention point of the present disclosure, so that the third technical problem mentioned in the background art is solved, and the complex data correlation is higher, the data integration efficiency is lower, so that the calculation resources required by the data restoration are increased, the time consumption is longer, and the difference exists between the computer carrying the structured data restoration and the required resources, so that the efficiency is lower, and the period of the structured data restoration is longer. Factors that lead to increased computational resources required for data repair, resulting in less efficiency, and longer periods of structured data repair tend to be as follows: when data repair is performed, due to higher complex data relevance and lower data integration efficiency, the calculation resources required by the data repair are increased, the time consumption is longer, and the difference exists between a computer carrying the structured data repair and the required resources, so that the efficiency is lower, and the period of the structured data repair is longer. If the above factors are solved, the effects of reducing the calculation resources required by the data restoration and shortening the period of the structured data restoration can be achieved. In order to achieve the effect, in a first step, knowledge graph conversion is performed on the structured data set, at least one structured data corresponding to the product log data set, and the parsed structured data set to obtain a structured data knowledge graph, where node information in the structured data knowledge graph is each structured data in the structured data set, each structured data in the at least one structured data, and each parsed structured data in the parsed structured data set, where the node information includes: node attribute information and identifier information. Thus, convenience can be provided for subsequent processing. And secondly, carrying out node pruning on the structured data knowledge graph to obtain a knowledge graph after node pruning. Therefore, redundant information can be reduced, and the efficiency of subsequent data processing is improved. And thirdly, carrying out graph network analysis on the knowledge graph after pruning of the nodes to obtain the analyzed knowledge graph. Therefore, the relation in the knowledge graph can be determined, and complex data can be processed conveniently. And data confusion is avoided. And fourthly, carrying out entity linking on each entity information in the analyzed knowledge graph and each entity information in the preset data source to obtain a linked entity information set. Thus, knowledge-graph information can be enriched. And fifthly, carrying out data consistency test on each piece of linked entity information in the linked entity information set to obtain a test result. Thus, the accuracy of the data can be ensured. And sixthly, in response to determining that the test results represent inconsistent tests, carrying out data alignment on at least one piece of linked entity information in the linked entity information set corresponding to the test results to obtain an aligned entity information set. Thus, data confusion due to too large data size can be avoided, thereby reducing the computational resources required for data repair thereafter. Seventhly, determining a scatter diagram of each aligned entity information in the aligned entity information set, and obtaining an entity information scatter diagram. And eighth step, determining fitting lines of the entity information scatter diagram. Thus, the relationship and the trend of change between the data can be better described. And a ninth step of determining the distances between the fitting lines and the respective aligned entity information in the aligned entity information set as an entity information fitting distance set. And tenth, eliminating the entity information fitting distance which does not meet the preset range condition in the entity information fitting distance set, and obtaining an entity information fitting distance set after elimination. Therefore, the efficiency can be improved, and the period of repairing the structured data can be shortened. And eleventh step, encrypting the entity information corresponding to each rejected entity information fitting distance in the rejected entity information fitting distance set to obtain an encrypted entity information set which is used as an integrated structured data set. Thus, the security of the structured data repair can be improved. Therefore, the computing resources required by data repair are reduced, and the period of structured data repair is shortened.
Step 105, determining the data hierarchy of each integrated structured data in the integrated structured data set, and obtaining a data hierarchy list.
In some embodiments, the executing entity may determine a data hierarchy of each integrated structured data in the integrated structured data set, and obtain a data hierarchy list.
Here, the above-described data hierarchy list may characterize data lists of the mth layer to the nth layer. Wherein m and n are positive numbers greater than 0. And m is less than or equal to the value n. For example, the data hierarchy list described above may characterize layer 1 to layer 3 data lists.
Optionally, the executing entity may determine a data hierarchy of each integrated structured data in the integrated structured data set by the following steps to obtain a data hierarchy list:
First, determining the tree data structure of the integrated structured data set.
Here, the above tree data structure may refer to a data structure of a tree type.
And step two, traversing each subtree corresponding to the tree data structure to obtain a structured data result after traversing.
Here, the above-described post-traversal structured data results may characterize the hierarchical relationship of the individual subtrees in the tree data structure.
And thirdly, carrying out hierarchy division on the data hierarchy corresponding to the structured data result after traversing to obtain at least one divided data hierarchy.
As an example, the executing body may classify the data hierarchy corresponding to the traversed structured data result under a preset condition, to obtain at least one classified data hierarchy, as at least one classified data hierarchy. The above-described preset condition may refer to "structured data having the same data hierarchy". For example, the preset condition may refer to "structured data in which the data levels are all the first layers".
And fourthly, ordering all the divided data levels in the at least one divided data level to obtain a data level list.
As an example, the executing body may perform incremental sorting on each divided data hierarchy in the at least one divided data hierarchy according to the numerical value of the hierarchy, to obtain a data hierarchy list.
And 106, determining the user behavior of the structured data corresponding to each data hierarchy in the data hierarchy list to obtain a user behavior list.
In some embodiments, the executing body may perform user behavior determination on the structured data corresponding to each data hierarchy in the data hierarchy list to obtain a user behavior list.
Here, the above-described user behavior may refer to an operation behavior of the user. For example, the user behavior may refer to a user browsing behavior. The above-described user behavior list may refer to a list including operation behaviors of a plurality of users.
Optionally, the executing body may perform user behavior determination on the structured data corresponding to each data hierarchy in the data hierarchy list through the following steps to obtain a user behavior list:
and the first step is to determine the operation time of the structured data corresponding to each data level in the data level list, so as to obtain the operation time of the structured data.
Here, the above-mentioned operation time may refer to 2023, 12, 01, 10 points, 30 minutes, 15 seconds.
And secondly, determining the operation frequency information of the user to obtain the operation frequency information in response to the fact that the operation time of the structured data meets the preset time condition.
Here, the above-mentioned preset time condition may mean "the preset time range is between 9 am and 5 pm".
And thirdly, determining user login information of the structured data corresponding to each data level in the data level list to obtain the user login information.
And step four, determining the abnormal information of the structured data corresponding to each data level in the data level list to obtain the abnormal information.
Here, the above-described abnormality information may refer to system failure information.
And fifthly, combining the operation frequency information, the user login information and the abnormal information into a user behavior list.
Here, the above-described user behavior list may characterize an unordered list arranged by user behavior information.
And step 107, performing security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result.
In some embodiments, the executing body may perform security verification on the integrated structured dataset according to the user behavior list, to obtain a verification result.
Here, the above test result may characterize a test abnormality or a test normal.
As an example, the executing body may perform access permission verification on each user behavior in the user behavior list, to obtain a permission verification result. And in response to determining that the authority verification result represents verification abnormality, performing significance verification on the integrated structured data set to obtain a verification result. And in response to determining that the authority verification result characterizes verification as normal, determining the authority verification result as a verification result of the integrated structured dataset.
And step 108, in response to determining that the inspection result represents inspection abnormality, performing data repair on the integrated structured data set to obtain a repaired structured data set.
In some embodiments, the execution body may perform data repair on the integrated structured data set to obtain a repaired structured data set in response to determining that the inspection result characterizes an inspection anomaly.
As an example, the execution body may determine the inspection result, and in response to determining that the inspection result is a format error result, perform data format conversion on the integrated structured dataset corresponding to the inspection result, to obtain a converted structured dataset, which is used as a repaired structured dataset. And then, in response to determining that the test result is a data repetition error, performing deduplication processing on the integrated structured data set corresponding to the test result to obtain a deduplicated structured data set, and using the deduplicated structured data set as a repaired structured data set.
Optionally, the executing body may perform data repair on the integrated structured dataset in response to determining that the inspection result characterizes an inspection anomaly by:
and firstly, in response to determining that the inspection result represents inspection abnormality, carrying out identity authority setting on the integrated structured data set to obtain a set structured data set.
And secondly, carrying out security hole detection on the structured data set after the setting to obtain a security hole detection result.
Here, the security vulnerabilities may include, but are not limited to, at least one of: user behavior is abnormal and code logic is wrong.
As an example, the execution body may use the vulnerability scanning tool to perform security vulnerability detection on the set structured data set, so as to obtain a security vulnerability detection result.
And thirdly, in response to determining that the security hole detection result represents abnormal user behavior, carrying out abnormal data isolation on the structured data set after setting corresponding to the security hole detection result to obtain an isolated structured data set.
As an example, the executing body may determine, in response to determining that the security hole detection result represents the user behavior abnormality, abnormal data of the set structured data set corresponding to the security hole detection result, to obtain structured data abnormal data. And then, the structured data abnormal data is removed from the structured data set after the setting, so that a structured data set after the removal is obtained and is used as an isolated structured data set.
And fourthly, carrying out security verification on the isolated structured data set to obtain a verified structured data set which is used as a repaired structured data set.
And step 109, storing the repaired structured data set to a log data analysis platform.
In some embodiments, the execution body may store the post-repair structured data set to a log data analysis platform.
Here, the above-described log data analysis platform may characterize a platform for analyzing a large amount of log data. For example, the log data analysis platform may be Splunk.
Optionally, after the step 109, the method further includes:
The first step, the real-time detection function in the log data analysis platform is utilized to detect the repaired structured data set in real time, and a real-time detection result is obtained.
Here, the above-described real-time monitoring function may characterize a function capable of detecting the repaired structured dataset in real time. The real-time detection result can represent real-time monitoring abnormality and real-time monitoring normality.
And secondly, responding to the fact that the real-time monitoring result represents that the detection result is abnormal, and carrying out probability density determination on each repaired structured data in the repaired structured data set by utilizing Gaussian distribution to obtain a probability density set.
Here, the gaussian distribution may also refer to a normal distribution (Normal distribution). The probability density in the probability density set may characterize how dense the post-repair structured data is in the gaussian distribution.
And thirdly, determining an abnormal data threshold range corresponding to the Gaussian distribution.
Here, the above-described abnormal data threshold range may refer to a data range outside the standard deviation range of the gaussian distribution of (-3, +3).
And a fourth step of determining at least one probability density of the probability density set that does not satisfy the abnormal data threshold range as an abnormal probability density set.
And fifthly, carrying out abnormal data marking on the structured data set corresponding to each abnormal probability density in the abnormal probability density sets to obtain an abnormal marking structured data set.
As an example, the execution body may first parse the structured data set corresponding to each abnormal probability density in the abnormal probability density sets to generate a parsed structured data set, to obtain a parsed structured data set. And then, extracting keywords from each parsed structured data set in the parsed structured data set groups to generate keywords corresponding to the structured data, and obtaining a keyword set corresponding to the structured data. And finally, keyword marking is carried out on the analyzed structured data by utilizing the keywords corresponding to each structured data in the keyword set corresponding to the structured data so as to generate marked structured data, and the marked structured data set is obtained and is used as an abnormal marked structured data set. The keywords may be keywords that characterize the anomaly data.
Sixth, displaying the abnormal mark structured data set on an interface of a user operating system to obtain abnormal data display interface information, wherein the abnormal data display interface information comprises: real-time parameter value information.
Here, the real-time parameter value information may refer to real-time data information displayed on an interface of the user operating system. For example, the real-time parameter value information may refer to abnormality degree value information of the abnormality flag structured data.
And seventh, monitoring the real-time parameter value information of the abnormal data display interface information.
Here, the above-mentioned monitoring may refer to monitoring.
And eighth step, in response to determining that the real-time parameter value information in the real-time parameter value information is greater than or equal to a preset parameter threshold, abnormal information is sent to the user operation system.
Here, the above-mentioned preset parameter threshold may refer to a maximum value of a preset parameter. For example, the preset parameter threshold may refer to 0.2.
And ninth, dynamically adjusting the abnormal mark structured data in the abnormal mark structured data set corresponding to the abnormal information to obtain an adjusted structured data set.
Here, the dynamic adjustment may be to adjust the exception marking structured data of the user operating system according to a real-time situation.
And tenth, in response to determining that the real-time parameter value information in the real-time parameter value information is smaller than a preset parameter threshold, determining at least one abnormal mark structured data smaller than the preset parameter threshold as a structured data set to be adjusted.
And eleventh step, carrying out parameter calibration on the structured data set to be adjusted to obtain a calibrated structured data set.
The related content in the first step to the eleventh step is taken as an invention point of the disclosure, which solves the second technical problem mentioned in the background art, namely, the log data analysis platform occupies a larger memory of the user operating system along with the storage of more and more repaired structured data sets, so that the load capacity of the user operating system is overlarge, and the stored structured data is easy to generate abnormity. The performance of the user operating system is poor because the abnormal data cannot be dynamically adjusted, which may cause the user operating system to malfunction. Factors that cause the stored structured data to be abnormal and the user operating system to fail are often as follows: with the storage of more and more repaired structured data sets, the log data analysis platform occupies a larger memory of a user operating system, so that the load capacity of the user operating system is overlarge, and the stored structured data is easy to generate abnormity. Because the abnormal data cannot be dynamically adjusted, the performance of the user operating system is poor, and the user operating system may be caused to be faulty. If the above factors are solved, the possibility of causing the abnormality of the stored structured data can be reduced, the performance of the user operating system is improved, and the user operating system is prevented from generating faults. In order to achieve the effect, in the first step, the real-time detection function in the log data analysis platform is utilized to detect the repaired structured data set in real time, so that a real-time detection result is obtained. Therefore, abnormal conditions of the data can be found in time, and potential problems caused to the log data analysis platform are avoided. And secondly, responding to the fact that the real-time monitoring result represents that the detection result is abnormal, and carrying out probability density determination on each repaired structured data in the repaired structured data set by utilizing Gaussian distribution to obtain a probability density set. Thus, the individual post-repair structured data can be quantified. And thirdly, determining an abnormal data threshold range corresponding to the Gaussian distribution. And a fourth step of determining at least one probability density of the probability density set that does not satisfy the abnormal data threshold range as an abnormal probability density set. Therefore, abnormal data can be dynamically adjusted, the performance of the user operating system is improved, and faults of the user operating system are avoided. And fifthly, carrying out abnormal data marking on the structured data set corresponding to each abnormal probability density in the abnormal probability density sets to obtain an abnormal marking structured data set. Thus, confusion of data can be avoided. Sixth, displaying the abnormal mark structured data set on an interface of a user operating system to obtain abnormal data display interface information, wherein the abnormal data display interface information comprises: real-time parameter value information. Thus, the subsequent processing can be facilitated. And seventh, monitoring the real-time parameter value information of the abnormal data display interface information. And eighth step, in response to determining that the real-time parameter value information in the real-time parameter value information is greater than or equal to a preset parameter threshold, abnormal information is sent to the user operation system. Thus, the data generating the abnormality can be dynamically adjusted. And ninth, dynamically adjusting the abnormal mark structured data in the abnormal mark structured data set corresponding to the abnormal information to obtain an adjusted structured data set. Thus, the user operating system can be prevented from generating faults. And tenth, in response to determining that the real-time parameter value information in the real-time parameter value information is smaller than a preset parameter threshold, determining at least one abnormal mark structured data smaller than the preset parameter threshold as a structured data set to be adjusted. Therefore, the exception of the stored structured data caused by the fact that the log data analysis platform occupies a large memory of the user operating system can be avoided. And eleventh step, carrying out parameter calibration on the structured data set to be adjusted to obtain a calibrated structured data set. Thereby, the accuracy and the comprehensiveness of the structured dataset to be adjusted are improved. Therefore, the possibility of causing the abnormality of the stored structured data is reduced, the performance of the user operating system is improved, and the user operating system is prevented from generating faults.
The above embodiments of the present disclosure have the following advantages: by the aid of the structured data restoration method, complexity among data is reduced, loss of the data in the processing process is reduced, possibility of occurrence of abnormality is reduced, and safety of the data is improved. Specifically, the complexity between data is increased, which easily causes loss of data in the processing process, so that the security of the data is reduced, and the possibility of occurrence of abnormality is increased because: along with the increasing number of different types of data generated by product information, the complexity between the data is improved, the data cannot be comprehensively processed by manual intervention, the association relation between the data cannot be fully considered, the data is easy to lose in the processing process, the safety of the data is reduced, and the possibility of abnormality is improved. Based on this, in the structured data restoration method of some embodiments of the present disclosure, first, log data corresponding to each product information is collected, and a product log data set is obtained. Thus, the product log data set can be obtained to facilitate subsequent operations. And then, generating a structured data group according to at least one unstructured data corresponding to the product log data set. Thus, unstructured data can be converted into structured data, thereby reducing complexity between the data. And then, carrying out format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set. Thereby, the readability of the data can be improved by format parsing. And secondly, integrating at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set. Therefore, a plurality of data sets can be integrated into one data set, so that the complexity between the data sets is reduced, and the loss generated in the subsequent processing process can be reduced. And then, determining the data level of each integrated structured data in the integrated structured data set to obtain a data level list. Therefore, the integrated structured data set can be hierarchically divided, the association relation among the data can be fully considered, and the loss of the data in the processing process is avoided. And then, carrying out user behavior determination on the structured data corresponding to each data hierarchy in the data hierarchy list to obtain a user behavior list. Thus, the security of data can be improved. And then, carrying out security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result. Thereby, the possibility of occurrence of abnormality can be reduced. And then, responding to the determination that the inspection result represents the inspection abnormality, and carrying out data repair on the integrated structured data set to obtain a repaired structured data set. Thus, the security of data can be improved. And finally, storing the repaired structured data set to a log data analysis platform, so that the safety of the data can be improved. Therefore, the complexity among the data is reduced, and the loss of the data in the processing process is reduced, so that the possibility of abnormality is reduced, and the safety of the data is improved.
With further reference to fig. 2, as an implementation of the method illustrated in the above figures, the present disclosure provides some embodiments of a structured data repair method, which apparatus embodiments correspond to those illustrated in fig. 1, and which apparatus is particularly applicable in a variety of electronic devices.
As shown in fig. 2, the structured data repair apparatus 200 of some embodiments includes: the system comprises an acquisition unit 201, a generation unit 202, a format analysis unit 203, a data integration unit 204, a first determination unit 205, a second determination unit 206, a security check unit 207, a data restoration unit 208 and a storage unit 209. The collecting unit 201 is configured to collect log data corresponding to each piece of product information, and obtain a product log data set; a generating unit 202 configured to generate a structured data group according to at least one unstructured data corresponding to the product log data set; a format parsing unit 203 configured to perform format parsing on at least one semi-structured data corresponding to the product log data set, so as to obtain a parsed structured data set; a data integration unit 204 configured to integrate at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set; a first determining unit 205 configured to determine a data hierarchy of each integrated structured data in the integrated structured data set, to obtain a data hierarchy list; a second determining unit 206, configured to determine a user behavior of the structured data corresponding to each data hierarchy in the data hierarchy list, so as to obtain a user behavior list; a security check unit 207 configured to perform security check on the integrated structured data set according to the user behavior list, to obtain a check result; a data repairing unit 208 configured to perform data repairing on the integrated structured data set to obtain a repaired structured data set in response to determining that the inspection result represents an inspection anomaly; the storage unit 209 is configured to store the post-repair structured data set to a log data analysis platform.
It will be appreciated that the elements described in the apparatus 200 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations, features and resulting benefits described above for the method are equally applicable to the apparatus 200 and the units contained therein, and are not described in detail herein.
Referring now to FIG. 3, a schematic diagram of an electronic device (e.g., computing device) 300 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with programs stored in a Read Only Memory (ROM) 302 or loaded from a storage 308 into a Random Access Memory (RAM) 304. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 304 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 3 may represent one device or a plurality of devices as needed.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 309, or from storage device 308, or from ROM 302. The computer program, when executed by the processing means 301, performs the functions defined in the methods of some embodiments of the present disclosure.
It should be noted that, the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: collecting log data corresponding to each piece of product information to obtain a product log data set; generating a structured data set according to at least one unstructured data corresponding to the product log data set; carrying out format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set; data integration is carried out on at least one structured data corresponding to the structured data set in the product log data set and the parsed structured data set, so that an integrated structured data set is obtained; determining the data level of each integrated structured data in the integrated structured data set to obtain a data level list; user behavior determination is carried out on the structured data corresponding to each data level in the data level list, and a user behavior list is obtained; performing security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result; responding to the determination that the inspection result represents the inspection abnormality, and carrying out data repair on the integrated structured data set to obtain a repaired structured data set; and storing the repaired structured data set to a log data analysis platform.
Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor comprising: the system comprises an acquisition unit, a generation unit, a format analysis unit, a data integration unit, a first determination unit, a second determination unit, a security check unit, a data restoration unit and a storage unit. The names of these units do not limit the unit itself in some cases, for example, the acquisition unit may also be described as "a unit that acquires log data corresponding to each product information, and obtains a product log data set".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
The above description is only illustrative of some of the preferred embodiments of the present disclosure and of the principles of the technology employed above. It will be appreciated by those skilled in the art that the scope of the invention in question in the embodiments of the present disclosure is not limited to the specific combination of features described above, but encompasses other embodiments in which any combination of features described above or their equivalents is contemplated without departing from the inventive concepts described above. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims (9)

1. A method of structured data repair, comprising:
Collecting log data corresponding to each piece of product information to obtain a product log data set;
Generating a structured data set according to at least one unstructured data corresponding to the product log data set;
carrying out format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set;
data integration is carried out on at least one structured data corresponding to the structured data set in the product log data set and the parsed structured data set, so that an integrated structured data set is obtained;
Determining the data level of each integrated structured data in the integrated structured data set to obtain a data level list;
user behavior determination is carried out on the structured data corresponding to each data hierarchy in the data hierarchy list, and a user behavior list is obtained;
Performing security inspection on the integrated structured data set according to the user behavior list to obtain an inspection result;
Responding to the determination that the inspection result represents inspection abnormality, carrying out data repair on the integrated structured data set to obtain a repaired structured data set;
And storing the repaired structured data set to a log data analysis platform.
2. The method of claim 1, wherein the generating a structured dataset from the corresponding at least one unstructured data in the product log dataset comprises:
Performing text recognition on each unstructured data in the at least one unstructured data to generate recognized text data, and obtaining a recognized text data set;
for each identified text data in the identified text data set, performing the following processing steps:
performing text segmentation on the identified text data to generate a segmented text data set;
extracting keywords from the segmented text data set to generate an extracted text keyword group;
performing part-of-speech tagging on the extracted text keyword group to obtain a text keyword group with the part-of-speech tagged;
Screening part-of-speech tagged text keywords with part-of-speech information being target part-of-speech information from the obtained part-of-speech tagged text keyword group set to obtain a target part-of-speech text keyword set;
Data extraction is carried out on at least one target part-of-speech text keyword in the target part-of-speech text keyword set, and extracted keyword groups are obtained;
and converting each extracted keyword in the extracted keyword groups according to a preset structure to obtain a structured data set.
3. The method of claim 1, wherein the determining the data hierarchy for each integrated structured data in the integrated structured data set results in a data hierarchy list comprising:
determining a tree data structure corresponding to the integrated structured dataset;
traversing each subtree corresponding to the tree data structure to obtain a structured data result after traversing;
Performing hierarchy division on the data hierarchy corresponding to the structured data result after traversing to obtain at least one divided data hierarchy;
And sequencing each divided data level in the at least one divided data level to obtain a data level list.
4. The method of claim 1, wherein the determining the user behavior of the structured data corresponding to each data hierarchy in the data hierarchy list to obtain the user behavior list includes:
Determining the operation time of the structured data corresponding to each data level in the data level list to obtain the operation time of the structured data;
Determining the operation frequency information of the user to obtain the operation frequency information in response to determining that the operation time of the structured data meets a preset time condition;
determining user login information of structured data corresponding to each data level in the data level list to obtain user login information;
Carrying out abnormal message determination on the structured data corresponding to each data level in the data level list to obtain abnormal information;
and combining the operation frequency information, the user login information and the abnormal information into a user behavior list.
5. The method of claim 1, wherein the performing data repair on the integrated structured dataset in response to determining that the inspection result characterizes an inspection anomaly, results in a repaired structured dataset, comprising:
In response to determining that the inspection result represents inspection abnormality, performing identity authority setting on the integrated structured dataset to obtain a set structured dataset;
performing security hole detection on the structured data set after setting to obtain a security hole detection result;
In response to determining that the security hole detection result represents abnormal user behavior, performing abnormal data isolation on the structured data set after the setting corresponding to the security hole detection result to obtain an isolated structured data set;
And carrying out security verification on the isolated structured data set to obtain a verified structured data set which is used as a repaired structured data set.
6. The method of claim 1, wherein the data integrating the structured data set with the parsed structured data set by the at least one structured data corresponding to the product log dataset, comprises:
performing data preprocessing on the at least one structured data to obtain a preprocessed structured data set;
creating a data perspective empty table;
filling each preprocessed structured data in the preprocessed structured data set and each structured data in the structured data set into the data perspective empty table to obtain an added data perspective table, wherein the added data perspective table is a data perspective table taking each preprocessed structured data in the preprocessed structured data set as column data and each structured data in the structured data set as row data;
performing field matching on each row of data in the added pivot table and each column of data corresponding to each row of data to obtain a matched structured data set, wherein the matched structured data set in the matched structured data set comprises: one row data and one column data;
For each matched structured data set in the matched structured data set, performing data connection on each matched structured data in the matched structured data set to obtain connected structured data;
and determining each obtained structured data after connection as an integrated data information set.
7. A structured data repair device comprising:
the acquisition unit is configured to acquire log data corresponding to each piece of product information to obtain a product log data set;
a generation unit configured to generate a structured data set from at least one unstructured data corresponding to the product log dataset;
The format analysis unit is configured to perform format analysis on at least one semi-structured data corresponding to the product log data set to obtain an analyzed structured data set;
The data integration unit is configured to integrate at least one structured data corresponding to the structured data set in the product log data set with the parsed structured data set to obtain an integrated structured data set;
A first determining unit configured to determine a data hierarchy of each integrated structured data in the integrated structured data set, resulting in a data hierarchy list;
The second determining unit is configured to determine user behaviors of the structured data corresponding to each data hierarchy in the data hierarchy list to obtain a user behavior list;
The security checking unit is configured to perform security checking on the integrated structured data set according to the user behavior list to obtain a checking result;
a data repair unit configured to perform data repair on the integrated structured dataset in response to determining that the inspection result characterizes an inspection anomaly, resulting in a repaired structured dataset;
And a storage unit configured to store the post-repair structured data set to a log data analysis platform.
8. An electronic device, comprising:
One or more processors;
A storage device having one or more programs stored thereon;
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1 to 6.
9. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1 to 6.
CN202410182674.1A 2024-02-19 2024-02-19 Structured data repair method, apparatus, electronic device and computer readable medium Active CN118035220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410182674.1A CN118035220B (en) 2024-02-19 2024-02-19 Structured data repair method, apparatus, electronic device and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410182674.1A CN118035220B (en) 2024-02-19 2024-02-19 Structured data repair method, apparatus, electronic device and computer readable medium

Publications (2)

Publication Number Publication Date
CN118035220A true CN118035220A (en) 2024-05-14
CN118035220B CN118035220B (en) 2024-07-30

Family

ID=91001795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410182674.1A Active CN118035220B (en) 2024-02-19 2024-02-19 Structured data repair method, apparatus, electronic device and computer readable medium

Country Status (1)

Country Link
CN (1) CN118035220B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050066222A1 (en) * 2003-09-23 2005-03-24 Revivio, Inc. Systems and methods for time dependent data storage and recovery
CN114661686A (en) * 2022-04-13 2022-06-24 中国工商银行股份有限公司 Message extraction method, device, equipment, medium and program product of log file
CN117472874A (en) * 2023-10-08 2024-01-30 联通数字科技有限公司 Government affair data resource integrated management system and method based on big data analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050066222A1 (en) * 2003-09-23 2005-03-24 Revivio, Inc. Systems and methods for time dependent data storage and recovery
CN114661686A (en) * 2022-04-13 2022-06-24 中国工商银行股份有限公司 Message extraction method, device, equipment, medium and program product of log file
CN117472874A (en) * 2023-10-08 2024-01-30 联通数字科技有限公司 Government affair data resource integrated management system and method based on big data analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HORITA H ET AL.: "Process Mining Approach Based on Partial Structures of Event Logs and Decision Tree Learning", 《PROCEEDINGS 2016 5TH IIAI INTERNATIONAL CONGRESS ON ADVANCED APPLIED INFORMATICS IIAI-AAI 2016》, 4 January 2017 (2017-01-04), pages 113 - 118 *
吉锋 等: "智能运维技术在电信大视频业务中的应用研究", 《信息通信技术》, no. 01, 15 February 2018 (2018-02-15), pages 30 - 36 *

Also Published As

Publication number Publication date
CN118035220B (en) 2024-07-30

Similar Documents

Publication Publication Date Title
He et al. An evaluation study on log parsing and its use in log mining
US10963634B2 (en) Cross-platform classification of machine-generated textual data
US20240080332A1 (en) System and method for gathering, analyzing, and reporting global cybersecurity threats
KR20190086346A (en) Anticipatory cyber defense
EP2728508A1 (en) Dynamic data masking
US10628250B2 (en) Search for information related to an incident
US11074119B2 (en) Automatic root cause analysis for web applications
US20160012082A1 (en) Content-based revision history timelines
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
Itkin et al. User-assisted log analysis for quality control of distributed fintech applications
CN116841779A (en) Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
US11601339B2 (en) Methods and systems for creating multi-dimensional baselines from network conversations using sequence prediction models
WO2016093839A1 (en) Structuring of semi-structured log messages
US10614100B2 (en) Semantic merge of arguments
Shahandashti et al. A PRISMA-driven systematic mapping study on system assurance weakeners
CN118035220B (en) Structured data repair method, apparatus, electronic device and computer readable medium
Kuang et al. Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
CN115051863B (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
US20230130649A1 (en) Techniques for semantic analysis of cybersecurity event data and remediation of cybersecurity event root causes
US20190164092A1 (en) Determining risk assessment based on assigned protocol values
CN114742051A (en) Log processing method, device, computer system and readable storage medium
CN112989403B (en) Database damage detection method, device, equipment and storage medium
Ehrlinger et al. Automating Data Quality Monitoring with Reference Data Profiles
Nevin et al. The non-linear impact of data handling on network diffusion models
CN114254081B (en) Enterprise big data search system, method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant