Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a device and a storage medium for structured parsing based on JSON data.
A structured parsing method based on JSON data, the method comprising: acquiring target JSON data, identifying the data attribute of the target JSON data, and dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute; acquiring a source field structure of the data module, performing path configuration on the source field to obtain a target field, and establishing a relationship between the source field and the target field to obtain a configuration file; extracting data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed; and merging and cleaning the data to be processed to obtain target data.
In one embodiment, the obtaining a source field of the data module, performing path configuration on the source field to obtain a target field, establishing a relationship between the source field and the target field, and obtaining a configuration file further includes: and dividing the data module into a simple type, an advanced type and a complex type according to the relation between the source field and the target field, wherein the simple type is that the source field and the target field are in one-to-one correspondence, the advanced type is that one source field exists and corresponds to a plurality of target fields, and the complex type is that a plurality of source fields exist and correspond to a plurality of target fields.
In one embodiment, the extracting data of the data module to obtain original data, and processing the original data according to a relationship between the source field and the target field in the configuration file to obtain to-be-processed data specifically includes: extracting data of the data module to obtain original data; and when the data module belongs to a simple class, the source field and the target field are in one-to-one correspondence, and the data to be processed is directly obtained according to the original data.
In one embodiment, the extracting data of the data module to obtain original data, and processing the original data according to a relationship between the source field and the target field in the configuration file to obtain to-be-processed data specifically includes: extracting data of the data module to obtain original data; and when the data module belongs to an advanced class, the source field exists in multiple levels, a one-to-many relationship exists between the source field and the target field, and the original data is subjected to flattening processing according to the relationship between the source field and the target field in the configuration file to obtain the data to be processed.
In one embodiment, the extracting data of the data module to obtain original data, and processing the original data according to a relationship between the source field and the target field in the configuration file to obtain to-be-processed data specifically includes: extracting data of the data module to obtain original data; and when the data module belongs to a complex class, the source field exists in multiple levels, the target field also exists in multiple levels, and the original data is processed according to the relation between the source field and the target field in the configuration file to obtain the data to be processed.
In one embodiment, after the merging and cleaning the data to be processed to obtain the target data, the method further includes: and establishing a database table, and filling the target data into the database table to obtain a standardized database table.
A JSON data-based structured analysis device comprises a data dividing module, a path configuration module, a data extraction module and a data processing module, wherein: the data dividing module is used for acquiring target JSON data, identifying the data attribute of the target JSON data, and dividing the target JSON data into a plurality of data modules according to the data attribute, wherein each data module corresponds to one data attribute; the path configuration module is used for acquiring the source field structure of the data module, performing path configuration on the source field to obtain a target field, and establishing the relationship between the source field and the target field to obtain a configuration file; the data extraction module is used for extracting the data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed; and the data processing module is used for merging and cleaning the data to be processed to obtain target data.
In one embodiment, the apparatus further comprises a table building module, wherein: the table establishing module is used for establishing a database table, and filling the target data into the database table to obtain a standardized database table.
An apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the JSON data-based structured parsing method described in the above embodiments.
A storage medium on which a computer program is stored, the program, when executed by a processor, implementing the steps of the JSON data-based structured parsing method described in the various embodiments above.
According to the method, the device, the equipment and the storage medium for structured analysis based on JSON data, the JSON data are subjected to module division according to data attributes to obtain a plurality of data modules, a configuration file of a source field and target field mapping relation is generated for each data module, then data extraction is carried out on the data modules to obtain original data, flattening processing is carried out on the original data according to the configuration file to obtain data to be processed, finally merging and cleaning processing are carried out on the data to be processed to obtain target data, and then the target data are filled into a preset database table to obtain a standardized database table. The data standardization of JSON data is realized, the problem of data diversity is solved, a more flexible, concise and clear configuration mode is realized, and the comprehensiveness and stability of data extraction are ensured.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In one embodiment, as shown in fig. 2, there is provided a structured parsing method based on JSON data, including the following steps:
s110, target JSON data is obtained, data attributes of the target JSON data are identified, the target JSON data are divided into a plurality of data modules according to the data attributes, and each data module corresponds to one data attribute.
Specifically, target JSON data is obtained firstly, then data attributes of the target JSON data are identified, multiple data attributes exist in the target JSON data, the target JSON data are divided into multiple data modules according to different data attributes, and each data module corresponds to one data attribute.
S120, obtaining the source field structure of the data module, performing path configuration on the source field to obtain a target field, and establishing the relationship between the source field and the target field to obtain a configuration file.
Specifically, the data sources of the modules have regional differences, the modules also have hierarchical differences, and the same dest _ key (target) in the same module may also have multiple source _ keys (components). This phenomenon is particularly evident in the declared information. The declaration information can be roughly divided into four types of declarations according to the difference of taxpayers and the difference of collection items. Slight difference exists in the same declaration type, and declaration types presented by the unified declaration type in different years are inconsistent, so that dest _ key correspondence of each module in different regions may be a plurality of source _ keys. For the sake of clarity, a proprietary path profile needs to be generated based on the different types of declaration.
In one embodiment, step S120 is followed by: and dividing the data module into a simple type, an advanced type and a complex type according to the relation between the source field and the target field, wherein the simple type is that the source field and the target field are in one-to-one correspondence, the advanced type is that one source field corresponds to a plurality of target fields, and the complex type is that a plurality of source fields correspond to a plurality of target fields. Specifically, according to the complexity of the path configuration, several modules are divided into 3 classes, one class is a simple class, one class is an advanced class, and the other class is a complex class, wherein, as shown in fig. 3, paths existing in source fields of configuration files of the simple class are in one-to-one correspondence with target fields, and only the paths are directly written into the configuration files; as shown in fig. 4, the advanced class is that there is one source field corresponding to multiple target fields, i.e. there is one required source field at the parent peer level of the required detail data field; as shown in fig. 5, a complex class is a complex class in which a plurality of source fields correspond to a plurality of target fields, that is, a data type of a type has a one-to-many relationship while a plurality of one-to-one relationships exist.
S130, extracting data of the data module to obtain original data, and processing the original data according to the relation between the source field and the target field in the configuration file to obtain data to be processed.
Specifically, the structured data has a multi-level problem, which is not beneficial to data cleaning operation, so that the original data needs to be flattened, and each original data is extracted to the same level by reading the configuration file generated in the first step. When processing advanced type or complex type data, the configuration file needs to be read twice, the classification information is read for the first time, and then the configuration file of the detail data is loaded on the basis of the classification data.
In one embodiment, step S130 specifically includes: extracting data of the data module to obtain original data; and when the data module belongs to the simple class, the source field and the target field are in one-to-one correspondence, and the data to be processed is directly obtained according to the original data. Specifically, as shown in fig. 3, when the data module belongs to a simple class, the original data directly corresponds to the data to be processed because the source field and the target field are in a one-to-one correspondence relationship.
In one embodiment, step S130 specifically includes: extracting data of the data module to obtain original data; and when the data module belongs to the advanced class, the source field exists in multiple levels and has a one-to-many relation with the target field, and the original data is subjected to flattening processing according to the relation between the source field and the target field in the configuration file to obtain the data to be processed. Specifically, as shown in fig. 4, when the data module belongs to the advanced class, the configuration file is made into the same hierarchical structure as the source field, and the first layer generates a configuration capable of screening out all records containing an Ori _ a field or specifying Ori _ a as a specific value; the second layer generates configuration files of other detail fields, the method is the same as a simple type, and the operation is essentially to flatten the original data according to the configuration files and then obtain the data to be processed.
In one embodiment, step S130 specifically includes: extracting data of the data module to obtain original data; and when the data module belongs to a complex class, the source field exists in multiple levels, the target field also exists in multiple levels, and the original data is processed according to the relation between the source field and the target field in the configuration file to obtain the data to be processed. Specifically, as shown in fig. 5, when the data module belongs to a complex class, that is, there is a one-to-many relationship and a many-to-one relationship, firstly, the one-to-many relationship is processed according to an advanced type processing method, and firstly, a classification configuration is generated according to the Ori _ a field; aiming at the many-to-one relation of the detail part, the detail part is flattened by a target field coding mode, such as: the Dest _ B field may be composed of two fields, Ori _ D and Ori _ E, and records in the configuration file that the value of the Ori _ D path is Dest _ B _0, the value of the Ori _ E path is Dest _ B _1, and if there are more, the label is Dest _ B _ n once, where n is the number corresponding to the source field. The rest fields are the same as the simple processing method. The essence of such operation is to flatten the original data according to the configuration file, and then obtain the data to be processed.
S140, merging and cleaning the data to be processed to obtain target data.
Specifically, classification data exists in the advanced type data module, and classification data Dest _ a needs to be allocated to each corresponding detail record, which can be implemented by the expode method of pandas. For a complex type data module, after completing the advanced type data processing, field logic processing is also required, and as shown in fig. 5, a new field Dest _ B is generated from Dest _ B _0 and Dest _ B _1 according to the field logic. And finally, performing deduplication operation, namely cleaning, on all the data according to the service logic, and finally obtaining target data, wherein the target data is standardized data, so that the structured analysis of the target JSON data is completed.
In one embodiment, as shown in fig. 2, after step S140, step S150 is further included: and establishing a database table, and filling the target data into the database table to obtain a standardized database table. Specifically, the target data is stored in the corresponding database tables according to each data module, so that the standardized database tables are obtained. The database table may be preset, or may be established after the target data is obtained.
In the embodiment, the JSON data is divided into modules according to data attributes to obtain a plurality of data modules, a configuration file of a source field and target field mapping relationship is generated for each data module, data extraction is performed on the data modules to obtain original data, flattening processing is performed on the original data according to the configuration file to obtain data to be processed, merging and cleaning processing are performed on the data to be processed to obtain target data, and the target data is filled into a preset database table to obtain a standardized database table. The data standardization of JSON data is realized, the problem of data diversity is solved, a more flexible, concise and clear configuration mode is realized, and the comprehensiveness and stability of data extraction are ensured.
In one embodiment, as shown in fig. 6, there is provided a JSON data-based structured parsing apparatus 200, which includes a data dividing module 210, a path configuration module 220, a data extraction module 230, and a data processing module 240, wherein:
the data dividing module 210 is configured to obtain target JSON data, identify a data attribute of the target JSON data, and divide the target JSON data into a plurality of data modules according to the data attribute, where each data module corresponds to one data attribute;
the path configuration module 220 is configured to obtain a source field configuration of the data module, perform path configuration on the source field to obtain a target field, and establish a relationship between the source field and the target field to obtain a configuration file;
the data extraction module 230 is configured to extract data of the data module to obtain original data, and process the original data according to a relationship between a source field and a target field in a configuration file to obtain data to be processed;
the data processing module 240 is configured to merge and clean the data to be processed to obtain target data.
In one embodiment, the apparatus further comprises a data classification module, wherein: the data classification module is used for dividing the data module into a simple type, an advanced type and a complex type according to the relation between the source field and the target field, wherein the simple type is that the source field and the target field are in one-to-one correspondence, the advanced type is that one source field corresponds to a plurality of target fields, and the complex type is that a plurality of source fields correspond to a plurality of target fields.
In one embodiment, the data extraction module 230 further comprises an extraction unit and a processing unit, wherein: the extraction unit is used for extracting the data of the data module to obtain original data; and the processing unit is used for obtaining the data to be processed directly according to the original data in a one-to-one correspondence relationship between the source field and the target field when the data module belongs to the simple class.
In one embodiment, the data extraction module 230 further comprises an extraction unit and a processing unit, wherein: the extraction unit is used for extracting the data of the data module to obtain original data; and the processing unit is used for flattening the original data according to the relationship between the source field and the target field in the configuration file to obtain the data to be processed.
In one embodiment, the data extraction module 230 further comprises an extraction unit and a processing unit, wherein: the extraction unit is used for extracting the data of the data module to obtain original data; and the processing unit is used for processing the original data according to the relation between the source field and the target field in the configuration file to obtain the data to be processed when the data module belongs to the complex class and the source field and the target field exist in the multiple hierarchies.
In one embodiment, the apparatus further comprises a table building module, wherein: the table establishing module is used for establishing a database table, and filling the target data into the database table to obtain a standardized database table.
In one embodiment, a device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 7. The device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the device is used for storing configuration templates and also can be used for storing target webpage data. The network interface of the device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a structured parsing method based on JSON data.
Those skilled in the art will appreciate that the configuration shown in fig. 6 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation on the devices to which the present application may be applied, and that a particular device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a storage medium is further provided, which stores a computer program comprising program instructions, which when executed by a computer, which may be part of the above-mentioned JSON data-based structured parsing apparatus, cause the computer to perform the method according to the preceding embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented in program code executable by a computing device, such that they may be stored on a computer storage medium (ROM/RAM, magnetic disks, optical disks) and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.