CN112131291B

CN112131291B - Structured analysis method, device and equipment based on JSON data and storage medium

Info

Publication number: CN112131291B
Application number: CN202010956101.1A
Authority: CN
Inventors: 刘德彬; 黄远江; 孙世通; 邓雪荣; 罗杰; 严絜
Original assignee: Chongqing Socialcredits Big Data Technology Co ltd
Current assignee: Chongqing Yucun Technology Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2023-12-15
Anticipated expiration: 2040-09-11
Also published as: CN112131291A

Abstract

The application provides a structural analysis method, a device, equipment and a storage medium based on JSON data. The data standardization of the JSON data is realized, the problem of diversity of the data is solved, and a more flexible, concise and clear configuration mode is realized, so that the comprehensiveness and stability of data extraction are ensured.

Description

Structured analysis method, device and equipment based on JSON data and storage medium

Technical Field

The present application relates to the field of system flow design, and in particular, to a JSON data-based structured analysis method, apparatus, device, and storage medium.

Background

JSON (JavaScript Object Notation) is a lightweight data exchange format, and JSON data belongs to semi-structured data, has a loose data structure and can store more complex data types, so JSON is widely applied to databases such as MongoDB (database based on distributed file storage) and the like to replace relational databases to store data with high concurrency. When targeted mining analysis is performed on mass data, sometimes structural analysis is required to be performed on JSON data, and the JSON data is converted into structural data so as to facilitate data mining analysis.

At present, when the JSON data is structured and analyzed, each JSON data is integrated into a data table mainly by analyzing key and value key value pairs to form structured data. Aiming at the existing structural analysis method of the JSON data, the JSON data is analyzed in a mode of directly extracting key and value key value pairs, when the data size of the JSON data is large, and when the key value is unfixed or path layers are deep, the structural analysis of the JSON data is easy to make errors, so that the efficiency of structural analysis of the JSON data is low.

The JSON structured data mapping system is widely applied in the field of data processing, and has value for enabling us to be more concise and clear when processing data, thereby being beneficial to data analysis. However, there is a very complex problem in the step of data normalization, and the difference in data sources means the complexity of data. The problem of diversity of data is a headache problem when data normalization is performed. Therefore, a more flexible and concise configuration manner is urgently needed to ensure the comprehensiveness and stability of data extraction.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, device and storage medium for structured parsing based on JSON data.

A structured parsing method based on JSON data, the method comprising: acquiring target JSON data, identifying the data attribute of the target JSON data, and dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute; acquiring a source field composition of the data module, carrying out path configuration on the source field to obtain a target field, and establishing a relation between the source field and the target field to obtain a configuration file; extracting data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed; and merging and cleaning the data to be processed to obtain target data.

In one embodiment, the obtaining the source field of the data module, performing path configuration on the source field to obtain a target field, and after establishing a relationship between the source field and the target field to obtain a configuration file, further includes: according to the relation between the source field and the target field, the data module is divided into three types of simple type, advanced type and complex type, wherein the simple type is that the source field and the target field are in one-to-one correspondence, the advanced type is that one source field corresponds to a plurality of target fields, and the complex type is that a plurality of source fields correspond to a plurality of target fields.

In one embodiment, the extracting the data of the data module obtains original data, and processes the original data according to the relationship between the source field and the target field in the configuration file to obtain data to be processed, specifically: extracting data of the data module to obtain original data; when the data module belongs to a simple class, the source field and the target field are in one-to-one correspondence, and the data to be processed is directly obtained according to the original data.

In one embodiment, the extracting the data of the data module obtains original data, and processes the original data according to the relationship between the source field and the target field in the configuration file to obtain data to be processed, specifically: extracting data of the data module to obtain original data; when the data module belongs to the advanced class, the source field exists in a multi-level and has a one-to-many relation with the target field, and flattening processing is performed on the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed.

In one embodiment, the extracting the data of the data module obtains original data, and processes the original data according to the relationship between the source field and the target field in the configuration file to obtain data to be processed, specifically: extracting data of the data module to obtain original data; when the data module belongs to complex class, the source field exists in multiple levels, the target field also exists in multiple levels, and the original data is processed according to the relation between the source field and the target field in the configuration file, so as to obtain data to be processed.

In one embodiment, after the merging and cleaning of the data to be processed to obtain the target data, the method further includes: and establishing a database table, and filling the target data into the database table to obtain a standardized database table.

The structured analysis device based on JSON data comprises a data dividing module, a path configuration module, a data extraction module and a data processing module, wherein: the data dividing module is used for acquiring target JSON data, identifying the data attribute of the target JSON data, dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute; the path configuration module is used for acquiring the source field composition of the data module, carrying out path configuration on the source field to obtain a target field, and establishing the relation between the source field and the target field to obtain a configuration file; the data extraction module is used for extracting the data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed; the data processing module is used for merging and cleaning the data to be processed to obtain target data.

In one embodiment, the apparatus further comprises a table establishment module, wherein: the table establishing module is used for establishing a database table, and filling the target data into the database table to obtain a standardized database table.

An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the JSON data-based structured parsing method described in the various embodiments above when the program is executed.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the JSON data-based structured parsing method described in the various embodiments above.

According to the structural analysis method, the device, the equipment and the storage medium based on the JSON data, the JSON data is subjected to module division according to the data attribute to obtain a plurality of data modules, a configuration file of the mapping relation between the source field and the target field is generated for each data module, then the data modules are subjected to data extraction to obtain original data, the original data is flattened according to the configuration file to obtain data to be processed, finally the data to be processed is combined and cleaned to obtain target data, and then the target data is filled into a preset database table to obtain a standardized database table. The data standardization of the JSON data is realized, the problem of diversity of the data is solved, and a more flexible, concise and clear configuration mode is realized, so that the comprehensiveness and stability of data extraction are ensured.

Drawings

FIG. 1 is a flow diagram of a method of structured parsing based on JSON data in one embodiment;

FIG. 2 is a flow diagram of a method of structured parsing based on JSON data in another embodiment;

FIG. 3 is a mapping relationship diagram of source fields and target fields of a simple class in one embodiment;

FIG. 4 is a mapping relationship diagram of source field and target field of an advanced class in one embodiment;

FIG. 5 is a mapping relationship diagram of source fields and target fields of complex classes in one embodiment;

FIG. 6 is a block diagram of a structured parsing apparatus based on JSON data in one embodiment;

fig. 7 is an internal structural diagram of the device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail by the following detailed description with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 2, a structural analysis method based on JSON data is provided, including the following steps:

s110, acquiring target JSON data, identifying the data attribute of the target JSON data, and dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute.

Specifically, first, target JSON data is acquired, then, the data attributes of the target JSON data are identified, multiple data attributes exist in the target JSON data, the target JSON data are divided into multiple data modules according to different data attributes, and each data module corresponds to one data attribute.

S120, acquiring a source field composition of the data module, carrying out path configuration on the source field to obtain a target field, and establishing a relation between the source field and the target field to obtain a configuration file.

Specifically, there are regional differences between the data sources of the modules, and there are also hierarchical differences between the modules, and there may be multiple source_keys (structures) in the same dest_key (target) in the same module. This phenomenon is particularly evident in the declaration information. The declaration information can be largely classified into four types of declarations according to the types of tax payers and the collection items. In the same reporting type, subtle differences exist, reporting patterns presented by the unified reporting type in different years are inconsistent, so that the dest_key correspondence of each module in different areas can be a plurality of source_keys. For clarity, proprietary path profiles need to be generated according to different declaration types.

In one embodiment, step S120 further includes: according to the relation between the source field and the target field, the data module is divided into three types of simple type, advanced type and complex type, wherein the simple type is that the source field corresponds to the target field one by one, the advanced type is that one source field corresponds to a plurality of target fields, and the complex type is that a plurality of source fields correspond to a plurality of target fields. Specifically, according to the complexity of path configuration, the modules are divided into 3 major classes, one is a simple class, one is a progressive class, and one is a complex class, wherein as shown in fig. 3, the paths of the source fields and the target fields of the configuration file of the simple class are in one-to-one correspondence, and the configuration file is directly written; as shown in fig. 4, the advanced class is that there is one source field corresponding to a plurality of target fields, that is, there is one required source field at the parent node peer level of the required detail data field; as shown in fig. 5, the complex class is that there are multiple source fields corresponding to multiple target fields, i.e., there is a one-to-many relationship with the data type of the type and there is a one-to-one relationship.

S130, extracting data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed.

Specifically, there is a multi-level problem for structured data, which is not conducive to data cleaning operation, so that the original data needs to be flattened, and each original data is extracted to the same level by reading the configuration file generated in the first step. When processing advanced type or complex type data, the configuration file needs to be read twice, classification information is read for the first time, and then the configuration file of detail data is loaded on the basis of classification data.

In one embodiment, step S130 is specifically: extracting data of the data module to obtain original data; when the data module belongs to a simple class, the source field and the target field are in one-to-one correspondence, and the data to be processed is directly obtained according to the original data correspondence. Specifically, as shown in fig. 3, when the data module belongs to a simple class, the original data directly corresponds to the data to be processed because the source field and the target field are in a one-to-one correspondence.

In one embodiment, step S130 is specifically: extracting data of the data module to obtain original data; when the data module belongs to the advanced class, the source field exists in a multi-level and has a one-to-many relation with the target field, and flattening processing is carried out on the original data according to the relation between the source field and the target field in the configuration file to obtain the data to be processed. Specifically, as shown in fig. 4, when the data module belongs to the advanced class, the configuration file is made into the same hierarchical structure as the source field, and the first layer generates a configuration capable of screening all records containing the ori_a field or designating ori_a as a specific value; the second layer generates configuration files of other detail fields, the method is the same as simple type, and the essence of the operation is that the original data is flattened according to the configuration files, and then the data to be processed is obtained.

In one embodiment, step S130 is specifically: extracting data of the data module to obtain original data; when the data module belongs to complex class, the source field exists in the multi-level and the target field also exists in the multi-level, and the original data is processed according to the relation between the source field and the target field in the configuration file to obtain the data to be processed. Specifically, as shown in fig. 5, when the data module belongs to a complex class, that is, there is a one-to-many relationship and there is a many-to-one relationship, firstly, the one-to-many relationship is processed according to a processing method of an advanced type, and firstly, a classification configuration is generated according to an ori_a field; aiming at the relation of the detail part in many-to-one mode, flattening the relation by a target field coding mode, such as: the dest_b field may be formed by two fields, namely ori_d and ori_e, where the value of the ori_d path is dest_b_0, the value of the ori_e path is dest_b_1, and if there are more, the number of the ori_d paths is dest_b_n, where n is the corresponding number of the source fields. The other fields are the same as the simple processing method. The essence of the operation is to flatten the original data according to the configuration file and then obtain the data to be processed.

And S140, merging and cleaning the data to be processed to obtain target data.

Specifically, the advanced data module has classification data, and the classification data dest_a needs to be allocated to each piece of corresponding detail record, which can be implemented by using the expode method of pandas. For complex type data modules, after the advanced type data processing is completed, field logic processing is further required, as shown in fig. 5, and new fields dest_b are generated according to the field logic by dest_b_0 and dest_b_1. And finally, performing de-duplication operation on all the data according to the business logic, namely cleaning, and finally obtaining target data, wherein the target data is standardized data, so that the structural analysis of the target JSON data is completed.

In one embodiment, as shown in fig. 2, after step S140, step S150 is further included: and establishing a database table, filling the target data into the database table, and obtaining a standardized database table. Specifically, the target data are stored in the corresponding database tables according to the data modules respectively, so that standardized database tables are obtained. The database table may be preset or built after the target data is obtained.

In the above embodiment, the JSON data is divided into the modules according to the data attribute to obtain a plurality of data modules, a configuration file of the mapping relationship between the source field and the target field is generated for each data module, then the data modules are subjected to data extraction to obtain the original data, the original data is flattened according to the configuration file to obtain the data to be processed, finally the data to be processed is combined and cleaned to obtain the target data, and then the target data is filled into a preset database table to obtain the standardized database table. The data standardization of the JSON data is realized, the problem of diversity of the data is solved, and a more flexible, concise and clear configuration mode is realized, so that the comprehensiveness and stability of data extraction are ensured.

In one embodiment, as shown in fig. 6, there is provided a JSON data-based structured parsing apparatus 200, which includes a data partitioning module 210, a path configuration module 220, a data extraction module 230, and a data processing module 240, wherein:

the data dividing module 210 is configured to obtain target JSON data, identify a data attribute of the target JSON data, and divide the target JSON data into a plurality of data modules according to the data attribute, where each data module corresponds to one data attribute;

the path configuration module 220 is configured to obtain a source field configuration of the data module, perform path configuration on the source field to obtain a target field, and establish a relationship between the source field and the target field to obtain a configuration file;

the data extraction module 230 is configured to extract data of the data module to obtain original data, and process the original data according to a relationship between a source field and a target field in the configuration file to obtain data to be processed;

the data processing module 240 is configured to combine and clean the data to be processed to obtain target data.

In one embodiment, the apparatus further comprises a data classification module, wherein: the data classification module is used for classifying the data module into three types of simple, advanced and complex according to the relation between the source field and the target field, wherein the simple type is that the source field and the target field are in one-to-one correspondence, the advanced type is that one source field corresponds to a plurality of target fields, and the complex type is that a plurality of source fields correspond to a plurality of target fields.

In one embodiment, the data extraction module 230 further comprises an extraction unit and a processing unit, wherein: the extraction unit is used for extracting the data of the data module to obtain original data; and the processing unit is used for directly obtaining the data to be processed according to the one-to-one correspondence of the source field and the target field when the data module belongs to the simple class.

In one embodiment, the data extraction module 230 further comprises an extraction unit and a processing unit, wherein: the extraction unit is used for extracting the data of the data module to obtain original data; and the processing unit is used for flattening the original data according to the relation between the source field and the target field in the configuration file to obtain the data to be processed.

In one embodiment, the data extraction module 230 further comprises an extraction unit and a processing unit, wherein: the extraction unit is used for extracting the data of the data module to obtain original data; the processing unit is used for processing the original data according to the relation between the source field and the target field in the configuration file to obtain the data to be processed when the data module belongs to the complex class and the source field exists in the multiple layers and the target field also exists in the multiple layers.

In one embodiment, the apparatus further comprises a table establishment module, wherein: the table establishing module is used for establishing a database table, filling target data into the database table, and obtaining a standardized database table.

In one embodiment, an apparatus is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the device is used for storing configuration templates and can also be used for storing target webpage data. The network interface of the device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a structured parsing method based on JSON data.

It will be appreciated by persons skilled in the art that the structure shown in fig. 6 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and does not constitute a limitation of the apparatus to which the present inventive arrangements are applied, and that a particular apparatus may include more or less components than those shown in the drawings, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is also provided a storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform a method as described in the previous embodiments, the computer being part of a JSON data-based structured parsing apparatus as referred to above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored on a computer storage medium (ROM/RAM, magnetic or optical disk) for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described herein, or they may be individually manufactured as individual integrated circuit modules, or a plurality of modules or steps in them may be manufactured as a single integrated circuit module. Therefore, the present application is not limited to any specific combination of hardware and software.

The foregoing is a further detailed description of the application in connection with specific embodiments, and is not intended to limit the practice of the application to such descriptions. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the application, and these should be considered to be within the scope of the application.

Claims

1. The structural analysis method based on JSON data is characterized by comprising the following steps of:

acquiring target JSON data, identifying the data attribute of the target JSON data, and dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute;

acquiring a source field composition of the data module, carrying out path configuration on the source field to obtain a target field, and establishing a relation between the source field and the target field to obtain a configuration file; according to the relation between the source field and the target field, the data module is divided into three types of simple types, advanced types and complex types, wherein the simple types are that the source field and the target field are in one-to-one correspondence, the advanced types are that one source field corresponds to a plurality of target fields, and the complex types are that a plurality of source fields correspond to a plurality of target fields;

extracting data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed; the method comprises the following steps:

when the data module belongs to a simple class, the source field and the target field are in one-to-one correspondence, and the data to be processed is directly obtained according to the original data correspondence;

when the data module belongs to the advanced class, the source field exists in a multi-level and has a one-to-many relation with the target field, and flattening processing is carried out on the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed;

when the data module belongs to complex class, the source field exists in multiple levels, the target field also exists in multiple levels, and the original data is processed according to the relation between the source field and the target field in the configuration file to obtain the data to be processed; specifically: firstly, processing a one-to-many relationship according to a processing method of a step type; aiming at the many-to-one relation of detail parts, flattening the relation by using a target field coding mode, wherein the other fields are the same as a simple processing method;

merging and cleaning the data to be processed to obtain target data; the data module of the advanced type has classification data, and the classification data is required to be distributed to each piece of corresponding detail record; the complex type data module also needs field logic processing after finishing advanced type data processing.

2. The method of claim 1, wherein after the merging and cleaning of the data to be processed to obtain the target data, further comprises:

and establishing a database table, and filling the target data into the database table to obtain a standardized database table.

3. The structured parsing device based on JSON data is characterized by comprising a data dividing module, a path configuration module, a data extraction module and a data processing module, wherein:

the data dividing module is used for acquiring target JSON data, identifying the data attribute of the target JSON data, dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute;

the path configuration module is used for acquiring the source field composition of the data module, carrying out path configuration on the source field to obtain a target field, and establishing the relation between the source field and the target field to obtain a configuration file; according to the relation between the source field and the target field, the data module is divided into three types of simple types, advanced types and complex types, wherein the simple types are that the source field and the target field are in one-to-one correspondence, the advanced types are that one source field corresponds to a plurality of target fields, and the complex types are that a plurality of source fields correspond to a plurality of target fields;

the data extraction module is used for extracting the data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed; the method comprises the following steps:

the data processing module is used for merging and cleaning the data to be processed to obtain target data; the data module of the advanced type has classification data, and the classification data is required to be distributed to each piece of corresponding detail record; the complex type data module also needs field logic processing after finishing advanced type data processing.

4. The apparatus of claim 3, further comprising a table creation module, wherein:

the table establishing module is used for establishing a database table, and filling the target data into the database table to obtain a standardized database table.

5. An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 2 when the computer program is executed.

6. A storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of any of claims 1 to 2.