CN112131337A

CN112131337A - Method, device and system for processing unstructured data and storage medium

Info

Publication number: CN112131337A
Application number: CN201910550927.5A
Authority: CN
Inventors: 何强
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2020-12-25

Abstract

The invention provides a method, a device, a system and a storage medium for processing unstructured data, wherein the method comprises the following steps: reading unstructured data and paragraph configuration information to be processed; carrying out segmentation processing on the unstructured data to obtain paragraph data; according to the paragraph configuration information, calling a corresponding verification rule to verify the paragraph data to obtain analyzed target data; and storing the target data into a database. Therefore, the method can perform segmented processing on the unstructured data, simplify the analysis process of the unstructured data, form a simple and unified data structure, and facilitate the editing of the unstructured data, thereby improving the reusability and expansibility of codes.

Description

Method, device and system for processing unstructured data and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a system, and a storage medium for processing unstructured data.

Background

With the development of internet technology, more and more fields relate to the application of Web. In Web projects, interaction with a front-end page is often required, and then unstructured data of html pages are stored in a relational database.

At present, a common processing scheme is to use an editing plug-in of unstructured data to store original html codes into a database, or store unstructured data of html pages into a table containing many fields in a structured manner after complex parsing.

However, in the above processing scheme, the parsing process of the unstructured data is complicated, and it is difficult to abstract html code into an overall component, which is not favorable for editing the unstructured data.

Disclosure of Invention

The invention provides a method, a device, a system and a storage medium for processing unstructured data, which can be used for processing the unstructured data in a segmented manner, simplifying the analysis process of the unstructured data to form a simple and uniform data structure, and facilitating the editing of the unstructured data, thereby improving the reusability and the expansibility of codes.

In a first aspect, an embodiment of the present invention provides a method for processing unstructured data, including:

reading unstructured data and paragraph configuration information to be processed;

carrying out segmentation processing on the unstructured data to obtain paragraph data;

according to the paragraph configuration information, calling a corresponding verification rule to verify the paragraph data to obtain analyzed target data;

and storing the target data into a database.

In one possible design, the paragraph configuration information includes: paragraph type, field name, filed name, check rule, and corresponding relation between paragraph type and component.

In one possible design, segmenting the unstructured data to obtain paragraph data includes:

sequentially reading each component in the unstructured data;

if the current component is a single component, converting the component content into paragraph data;

if the current component is an integral component, splitting the integral component according to the parent-child relationship of the integral component to obtain a split component;

and converting the content of each split component into paragraph data, wherein the single component and the content of the split component only contain one paragraph type.

In one possible design, according to paragraph configuration information, invoking a corresponding verification rule to verify the paragraph data to obtain analyzed target data, including:

generating additional information corresponding to the paragraph type according to the verification rule;

and supplementing the additional information in the paragraph data to obtain the target data.

In one possible design, further comprising:

and carrying out any one or more editing operations of deletion, modification and addition on the paragraph configuration information to obtain updated paragraph configuration information.

In one possible design, further comprising:

retrieving the target data from the database;

creating a single component or an integral component according to the paragraph type of the target data;

taking the target data as the component content of the single component or the whole component;

displaying the single component or the whole component on the client.

In a second aspect, an embodiment of the present invention provides an apparatus for processing unstructured data, including:

the reading module is used for reading unstructured data to be processed and paragraph configuration information;

the segmentation module is used for carrying out segmentation processing on the unstructured data to obtain paragraph data;

the processing module is used for calling a corresponding verification rule to verify the paragraph data according to the paragraph configuration information to obtain analyzed target data;

and the storage module is used for storing the target data into a database.

In one possible design, the segmentation module is specifically configured to:

sequentially reading each component in the unstructured data;

In one possible design, the processing module is specifically configured to:

In one possible design, further comprising:

and the editing module is used for carrying out any one or more editing operations of deletion, modification and addition on the paragraph configuration information to obtain the updated paragraph configuration information.

In one possible design, further comprising: the display module is specifically configured to:

retrieving the target data from the database;

displaying the single component or the whole component on the client.

In a third aspect, an embodiment of the present invention provides a system for processing unstructured data, including: the device comprises a memory and a processor, wherein the memory stores executable instructions of the processor; wherein the processor is configured to perform the method of processing unstructured data of any one of the first aspect via execution of the executable instructions.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the processing method of unstructured data described in any one of the first aspects.

In a fifth aspect, an embodiment of the present invention provides a program product, where the program product includes: a computer program stored in a readable storage medium, from which at least one processor of a server can read the computer program, the at least one processor executing the computer program causing the server to perform the method of processing unstructured data described in any one of the first aspects.

The invention provides a processing method, a device and a system of unstructured data and a storage medium, which are characterized in that the unstructured data to be processed and paragraph configuration information are read; carrying out segmentation processing on the unstructured data to obtain paragraph data; according to the paragraph configuration information, calling a corresponding verification rule to verify the paragraph data to obtain analyzed target data; and storing the target data into a database. Therefore, the method can perform segmented processing on the unstructured data, simplify the analysis process of the unstructured data, form a simple and unified data structure, and facilitate the editing of the unstructured data, thereby improving the reusability and expansibility of codes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application scenario of the present invention;

FIG. 2 is a flowchart of a method for processing unstructured data according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating paragraph configuration information according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating paragraph data provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of target data provided by an embodiment of the present invention;

fig. 6(a) is a schematic diagram of paragraph data corresponding to a video type according to an embodiment of the present invention;

fig. 6(b) is a schematic diagram of target data corresponding to video types according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for processing unstructured data according to a second embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus for processing unstructured data according to a third embodiment of the present invention;

fig. 9 is a schematic structural diagram of an apparatus for processing unstructured data according to a fourth embodiment of the present invention;

fig. 10 is a schematic structural diagram of a system for processing unstructured data according to a fifth embodiment of the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

With the development of internet technology, more and more fields relate to the application of Web. In Web projects, interaction with a front-end page is often required, and then unstructured data of html pages are stored in a relational database. At present, a common processing scheme is to use an editing plug-in of unstructured data to store original html codes into a database, or store unstructured data of html pages into a table containing many fields in a structured manner after complex parsing. However, in the above processing scheme, the parsing process of the unstructured data is complicated, and it is difficult to abstract html code into an overall component, which is not favorable for editing the unstructured data. Taking rich text on html as an example, the disadvantages of directly storing rich text mainly include the following aspects: 1. rich text parsing is relatively difficult. html controls are of various types, and a set of overall resolution schemes is not easy. 2. Html codes in rich texts are not easy to abstract into an integral component, and even if a section of characters, pictures and links have a certain incidence relation, the html codes cannot be clearly identified from the codes due to different operation modes of users. 3. The rich text editing has the main advantages of mixed image and text arrangement and personalized processing of formats such as fonts and blank lines, but for some special data such as commodity cards and coupons in the east of Jing, if the user submits the rich text to personalized editing, the formats are not uniform, and data disorder is caused. The pure structured scheme has obvious defects, one is similar to rich text, the parsing process is complex, and the other is not easy to expand even if the pure structured scheme is parsed out, new database fields need to be added when new data types are added, and the number of fields of the past table is too large. If different data types share one table, some data type fields are many, and some data type fields are few, which causes field waste. The pure structured scheme is suitable for scenes with low field change probability and uniform data types.

In view of the above technical problems, the present invention provides a method, which can perform segmented processing on unstructured data, simplify the parsing process of unstructured data, form a simple and unified data structure, and facilitate editing of unstructured data, thereby improving reusability and extensibility of codes. In a specific implementation process, fig. 1 is a schematic diagram of an application scenario of the present invention, as shown in fig. 1, a server 10 side reads unstructured data and paragraph configuration information to be processed, and then performs segmentation processing on the read unstructured data according to the paragraph configuration information to obtain paragraph data; then, a check rule corresponding to the paragraph type is called through a checker 11 to check the paragraph data to obtain target data; the target data is finally stored in the database 12. Further, when the unstructured data needs to be presented at the client 20, the target data can be read from the server 10 side, and the target data can be reassembled into the unstructured data by the client 20 so as to be displayed on the display interface of the client 20. Therefore, the method can perform segmented processing on the unstructured data, simplify the analysis process of the unstructured data, form a simple and unified data structure, and facilitate the editing of the unstructured data, thereby improving the reusability and expansibility of codes.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of a processing method of unstructured data according to an embodiment of the present invention, as shown in fig. 2, the method in this embodiment may include:

s101, reading unstructured data to be processed and paragraph configuration information.

In this embodiment, the unstructured data to be processed is data that needs to be processed in the present invention, and the data may include basic components such as plain text, pictures, commodity sku, links, and videos, or combinations between these basic components, for example, a combination of text and pictures, a combination of links and words, and the like.

Optionally, the paragraph configuration information includes: paragraph type, field name, filed name, check rule, and corresponding relation between paragraph type and component.

Specifically, the unstructured data to be processed may be considered to be composed of one paragraph, and parsing may be performed in a segmented and concurrent manner according to paragraph configuration information. The paragraph configuration information defines different paragraph types, corresponding check rules, components, database corresponding fields, and the like. Fig. 3 is a schematic diagram of paragraph configuration information according to an embodiment of the present invention, and as shown in fig. 3, a first column is a paragraph type, which may correspond to different components according to the paragraph type, and a code format generated by adding the paragraph may be determined at a client; the second column is a field name, which represents the characters displayed on the page, namely the data content; the third column is the field name put in storage; the fourth column is a check rule, which can be used to judge input validity, check data content, and supplement additional information. As for the paragraph configuration information shown in fig. 3, different basic components and verification rules are corresponding to the data of text and picture types.

Optionally, the method further comprises: and carrying out any one or more editing operations of deletion, modification and addition on the paragraph configuration information to obtain the updated paragraph configuration information.

Specifically, the paragraph configuration information is editable, configurable. According to the demand of the unstructured data to be processed, any one or any plurality of editing operations of flexible deletion, modification and addition can be carried out. For example, if a new live broadcast text-added component is added to unstructured data to be processed, only a paragraph type needs to be added, and the new live broadcast text-added component corresponds to a live broadcast basic component and a check rule. If a timing on-off function is added to the sku, the fields startTime and endTime can be expanded on the original commodity card type.

And S102, carrying out segmentation processing on the unstructured data to obtain paragraph data.

In this embodiment, each component in the unstructured data is read in sequence; if the current component is a single component, converting the component content into paragraph data; if the current component is an integral component, splitting the integral component according to the parent-child relationship of the integral component to obtain a split component; and converting the content of each split component into paragraph data, wherein the content of the single component and the content of the split component only contain one paragraph type.

Specifically, the unstructured data to be processed is data that needs to be processed by the present invention, and the data may include basic components such as plain text, pictures, commodities sku, links, videos, and the like, or combinations between these basic components, such as combinations of texts and pictures, combinations of links and words, and the like. Such an overall component can be viewed as a permutation and combination of the underlying components, and parent-child relationships between the components can be established. If a single base component is encountered, directly converting the component content into paragraph data; if the whole component is encountered, the whole component is split into a plurality of basic components for processing according to the parent-child relationship. Converting each split component content into paragraph data. The paragraph data is unified in format and comprises three fields of type, content and children, wherein type represents the paragraph type, content represents the paragraph data, and children represents the configuration of the child nodes. Fig. 4 is a schematic diagram of paragraph data according to an embodiment of the present invention, and as shown in fig. 4, a first component is an overall component, where a paragraph type is 20, which indicates an aggregation type, and corresponds to an aggregation type check rule. The whole assembly comprises three basic assemblies, namely a commodity title, an atmosphere diagram and a commodity sku, paragraph types are 1 (plain text), 2 (picture) and 5(sku) respectively, check rules are TextHandler, ImageHandler and SkuHandler respectively, and the commodity title, the atmosphere diagram and the sku are processed respectively; the second component is the base component, paragraph type 1, is a plain text, corresponding to TextHandler. It is the responsibility of the Handler to check whether the content of each paragraph meets the requirements and to fill in some additional data, based on the defined requirements of the field configuration.

S103, according to the paragraph configuration information, calling a corresponding verification rule to verify the paragraph data to obtain analyzed target data.

In this embodiment, additional information corresponding to the paragraph type is generated according to the verification rule; and supplementing additional information in the paragraph data to obtain target data.

Specifically, the paragraph configuration information defines different paragraph types, corresponding check rules, components, database corresponding fields, and the like. And calling a checking rule corresponding to the paragraph type to perform concurrent checking on the paragraph data and assemble the data. Fig. 5 is a schematic diagram of target data provided in an embodiment of the present invention, and the paragraph data shown in fig. 4 is processed to obtain the target data shown in fig. 5. The difference between the target data and the paragraph data is that the target data is more supplemental data, which is based on the metadata in the paragraph data. Taking video type data processing as an example, fig. 6(a) is a schematic diagram of paragraph data corresponding to a video type provided in an embodiment of the present invention; fig. 6(b) is a schematic diagram of target data corresponding to video types according to an embodiment of the present invention; paragraph data as shown in fig. 6(a), only the video Id, i.e., the videoId field, is recorded; and the target data is as shown in fig. 6(b), in which information such as image, video size, etc. is supplemented.

And S104, storing the target data into a database.

In this embodiment, the processed target data is stored in the relational database according to the parent-child relationship and the field correspondence relationship.

In this embodiment, the unstructured data and paragraph configuration information to be processed are read; carrying out segmentation processing on the unstructured data to obtain paragraph data; according to the paragraph configuration information, calling a corresponding verification rule to verify the paragraph data to obtain analyzed target data; storing the target data in a database. Therefore, the method can perform segmented processing on the unstructured data, simplify the analysis process of the unstructured data, form a simple and unified data structure, and facilitate the editing of the unstructured data, thereby improving the reusability and expansibility of codes.

Fig. 7 is a flowchart of a processing method of unstructured data according to a second embodiment of the present invention, and as shown in fig. 7, the method in this embodiment may include:

s201, reading unstructured data to be processed and paragraph configuration information.

S202, carrying out segmentation processing on the unstructured data to obtain paragraph data.

And S203, calling a corresponding verification rule to verify the paragraph data according to the paragraph configuration information to obtain the analyzed target data.

And S204, storing the target data into a database.

In this embodiment, please refer to the relevant description in step S101 to step S104 in the method shown in fig. 2 for the specific implementation process and technical principle of step S201 to step S204, which is not described herein again.

And S205, retrieving target data from the database.

S206, creating a single component or an integral component according to the paragraph type of the target data.

And S207, taking the target data as the component content of a single component or an integral component.

And S208, displaying the single component or the whole component on the client.

In this embodiment, steps S201 to S204 implement processing on unstructured data to obtain structured data, and store the structured data in the database. And step S205 to step S206 realize that the structured data in the database is converted into unstructured data and displayed on the client.

Specifically, first, target data is read from a database. Then, a single component is created according to the paragraph type of the target data. And if the target data contains a parent-child relationship, assembling the corresponding single components to obtain an integral component. And the target data is taken as the component content of a single component or an integral component and is displayed on the client.

In addition, the embodiment can also read the structured data in the database, convert the structured data into unstructured data and display the unstructured data on the client.

Fig. 8 is a schematic structural diagram of a processing apparatus for unstructured data according to a third embodiment of the present invention, and as shown in fig. 8, the processing apparatus for unstructured data according to the present embodiment may include:

a reading module 31, configured to read unstructured data and paragraph configuration information to be processed;

the segmentation module 32 is configured to perform segmentation processing on the unstructured data to obtain paragraph data;

the processing module 33 is configured to invoke a corresponding verification rule to verify the paragraph data according to the paragraph configuration information, so as to obtain analyzed target data;

and a storage module 34 for storing the target data in the database.

In one possible design, the segmentation module 32 is specifically configured to:

reading each component in the unstructured data in sequence;

and converting the content of each split component into paragraph data, wherein the content of the single component and the content of the split component only contain one paragraph type.

In one possible design, the processing module 33 is specifically configured to:

generating additional information corresponding to the paragraph type according to the check rule;

and supplementing additional information in the paragraph data to obtain target data.

The processing apparatus for unstructured data of this embodiment may execute the technical solution in the method shown in fig. 2, and for specific implementation processes and technical principles, reference is made to the relevant description in the method shown in fig. 2, which is not described herein again.

Fig. 9 is a schematic structural diagram of an unstructured data processing apparatus according to a fourth embodiment of the present invention, and as shown in fig. 9, the unstructured data processing apparatus according to the present embodiment may further include, on the basis of the apparatus shown in fig. 8:

the editing module 35 is configured to perform any one or any multiple of deleting, modifying, and adding on the paragraph configuration information to obtain updated paragraph configuration information.

In one possible design, further comprising: the display module 36 is specifically configured to:

target data is called from a database;

taking the target data as the component content of a single component or an integral component;

a single component or an entire component is displayed on the client.

The processing apparatus of unstructured data of this embodiment may execute the technical solutions in the methods shown in fig. 2 and fig. 7, and the specific implementation process and technical principle of the technical solutions refer to the relevant descriptions in the methods shown in fig. 2 and fig. 7, which are not described herein again.

Fig. 10 is a schematic structural diagram of a processing system of unstructured data according to a fifth embodiment of the present invention, and as shown in fig. 10, the processing system 40 of unstructured data according to this embodiment may include: a processor 41 and a memory 42.

A memory 42 for storing programs; the Memory 42 may include a volatile Memory (RAM), such as a Static Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memory 42 is used to store computer programs (e.g., applications, functional modules, etc. that implement the above-described methods), computer instructions, etc., which may be stored in one or more of the memories 42 in a partitioned manner. And the above-mentioned computer program, computer instructions, data, etc. can be called by the processor 41.

The computer programs, computer instructions, etc. described above may be stored in one or more memories 42 in partitions. And the above-mentioned computer program, computer instructions, data, etc. can be called by the processor 41.

A processor 41 for executing the computer program stored in the memory 42 to implement the steps of the method according to the above embodiments.

Reference may be made in particular to the description relating to the preceding method embodiment.

The processor 41 and the memory 42 may be separate structures or may be integrated structures integrated together. When the processor 41 and the memory 42 are separate structures, the memory 42 and the processor 41 may be coupled by a bus 43.

The processing system of unstructured data of this embodiment may execute the technical solutions in the methods shown in fig. 2 and fig. 7, and the specific implementation process and technical principle of the technical solutions refer to the related descriptions in the methods shown in fig. 2 and fig. 7, which are not described herein again.

In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.

Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.

The present application further provides a program product comprising a computer program stored in a readable storage medium, from which the computer program can be read by at least one processor of a server, the execution of the computer program by the at least one processor causing the server to carry out the method of any of the embodiments of the invention described above.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for processing unstructured data, comprising:

and storing the target data into a database.

2. The method of claim 1, wherein the paragraph configuration information comprises: paragraph type, field name, filed name, check rule, and corresponding relation between paragraph type and component.

3. The method of claim 1, wherein segmenting the unstructured data into paragraph data comprises:

sequentially reading each component in the unstructured data;

4. The method according to claim 2, wherein according to the paragraph configuration information, invoking a corresponding verification rule to verify the paragraph data to obtain the parsed target data, comprising:

5. The method of claim 1, further comprising:

6. The method according to any one of claims 1-5, further comprising:

retrieving the target data from the database;

displaying the single component or the whole component on the client.

7. An apparatus for processing unstructured data, comprising:

and the storage module is used for storing the target data into a database.

8. The apparatus of claim 7, wherein the paragraph configuration information comprises: paragraph type, field name, filed name, check rule, and corresponding relation between paragraph type and component.

9. A system for processing unstructured data, comprising: the device comprises a memory and a processor, wherein the memory stores executable instructions of the processor; wherein the processor is configured to perform the method of processing unstructured data of any of claims 1-6 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out a method for processing unstructured data according to any one of the claims 1 to 6.