CN112966038A

CN112966038A - Method and device for extracting structured data from unstructured data

Info

Publication number: CN112966038A
Application number: CN202110262891.8A
Authority: CN
Inventors: 陈洲; 张志恒; 沈云; 莫钧涛
Original assignee: Guotai Epoint Software Co Ltd
Current assignee: Guotai Epoint Software Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-15

Abstract

The application relates to a method and a device for extracting structured data from unstructured data, belonging to the technical field of computers, wherein the method comprises the following steps: acquiring a target document; performing data segmentation on the target document to obtain a plurality of data fragments in the target document; sequentially inputting a plurality of data segments into a pre-trained data classification model to obtain each classification label included by each data segment and data content corresponding to each classification label; storing each classification label and corresponding data content into a structured database to obtain structured data; displaying the structured data through the form; the problem that when data are input in an unstructured data form, different personnel input data in different and uniform modes, and accordingly document input and review efficiency is low can be solved; because the unstructured data in the target document can be displayed in the form of the structural data, the document entry and review efficiency can be improved, and the accuracy of extracting the unstructured data can be improved.

Description

Method and device for extracting structured data from unstructured data

Technical Field

The application relates to a method and a device for extracting structured data from unstructured data, and belongs to the technical field of computers.

Background

The current government purchasing system requires a buyer to input the purchasing requirement in a structured way for subsequent series of functions such as intelligent bid evaluation and the like.

At present, the purchasing requirements of a buyer are divided into a cargo list, technical requirements and the like, and are usually presented in a word format document.

However, the mode of inputting the purchasing requirement by different personnel can be different, which can lead to the final presented document format being not uniform, and the problem of low document inputting and reviewing efficiency is caused.

Disclosure of Invention

The application provides a method and a device for extracting structured data from unstructured data, which can solve the problem that when data are input in an unstructured data form, the input mode of data is different and uniform by different personnel, so that the document input and review efficiency is low. The application provides the following technical scheme:

in a first aspect, a method for extracting structured data from unstructured data is provided, the method comprising:

acquiring a target document, wherein the target document comprises unstructured data to be extracted;

performing data segmentation on the target document to obtain a plurality of data fragments in the target document;

sequentially inputting the plurality of data segments into a pre-trained data classification model to obtain each classification label included by each data segment and data content corresponding to each classification label; the data classification model is obtained by using a plurality of groups of training data for training in advance, and each group of training data comprises: labeling a plurality of sample data fragments and a classification label corresponding to each sample data fragment;

storing each classification label and corresponding data content into a structured database to obtain structured data;

and displaying the structured data through a form.

Optionally, the performing data segmentation on the target document to obtain a plurality of data segments in the target document includes:

extracting text content in the target document through a file content extraction tool;

and performing data segmentation on the text content according to preset punctuations to obtain a plurality of data fragments.

Optionally, the classification label is determined based on data extraction requirements of the unstructured data.

Optionally, before the step of sequentially inputting the plurality of data segments into a pre-trained data classification model to obtain each classification label included in each data segment and a data segment corresponding to each classification label, the method further includes:

obtaining a sample document;

performing data cutting on the sample document to obtain a plurality of sample data fragments in the sample document;

labeling each sample data fragment according to the data extraction requirement to obtain a corresponding classification label;

inputting the sample data fragments into a preset neural network model to obtain a model result;

and training the neural network model based on a preset loss function, the model result and the classification label to obtain the data classification model.

Optionally, the sample document includes unstructured data related to the data extraction requirements.

Optionally, the data classification model is built based on a bi-directional encoder representation BERT model of the converter.

Optionally, the displaying the structured data through a form includes:

and displaying the structured data in a form through a webpage.

Optionally, the target document is a word document, the unstructured data to be extracted is stored in the word document in a non-fixed format, and a historical stock document exists.

In a second aspect, an apparatus for extracting structured data from unstructured data is provided, the apparatus comprising:

the document acquisition module is used for acquiring a target document, and the target document comprises unstructured data to be extracted;

the data cutting module is used for carrying out data cutting on the target document to obtain a plurality of data fragments in the target document;

the data classification module is used for sequentially inputting the data fragments into a pre-trained data classification model to obtain each classification label included by each data fragment and data content corresponding to each classification label; the data classification model is obtained by using a plurality of groups of training data for training in advance, and each group of training data comprises: labeling a plurality of sample data fragments and a classification label corresponding to each sample data fragment;

the structured storage module is used for storing each classification label and corresponding data content into a structured database to obtain structured data;

and the data display module is used for displaying the structured data through a form.

The beneficial effect of this application lies in: obtaining a target document, wherein the target document comprises unstructured data to be extracted; performing data segmentation on the target document to obtain a plurality of data fragments in the target document; sequentially inputting a plurality of data segments into a pre-trained data classification model to obtain each classification label included by each data segment and data content corresponding to each classification label; the data classification model is obtained by using a plurality of groups of training data for training in advance, and each group of training data comprises: labeling a plurality of sample data fragments and a classification label corresponding to each sample data fragment; storing each classification label and corresponding data content into a structured database to obtain structured data; displaying the structured data through the form; the problem that when data are input in an unstructured data form, different personnel input data in different and uniform modes, and accordingly document input and review efficiency is low can be solved; because unstructured data in the target document can be displayed in the form of structured data, document entry and review efficiency can be improved.

In addition, after the target document is subjected to data segmentation, each data segment is classified and identified by using the data classification model, so that the accuracy of data classification can be improved, and the accuracy of structured data extraction is improved.

The foregoing description is only an overview of the technical solutions of the present application, and in order to make the technical solutions of the present application more clear and clear, and to implement the technical solutions according to the content of the description, the following detailed description is made with reference to the preferred embodiments of the present application and the accompanying drawings.

Drawings

FIG. 1 is a flow diagram of a method for extracting structured data from unstructured data as provided by one embodiment of the present application;

FIG. 2 is a block diagram of an apparatus for extracting structured data from unstructured data as provided by one embodiment of the present application;

FIG. 3 is a block diagram of an apparatus for extracting structured data from unstructured data according to one embodiment of the present application.

Detailed Description

The following detailed description of embodiments of the present application will be described in conjunction with the accompanying drawings and examples. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.

First, several terms referred to in the present application will be described.

Bidirectional Encoder Representation of the converter (BERT): namely the Encoder of the bidirectional Transformer, a good feature representation is learned for words by operating an automatic supervision learning method on the basis of massive linguistic data. The self-supervised learning refers to supervised learning which is operated on data without manual labeling. The network architecture of BERT includes a multi-layer Transformer structure. Wherein, the Transformer is a structure of an encoder-decoder (encoder-decoder); formed by stacking several encoders and decoders. The encoder is used to convert the input expectation into a feature vector, and the input of the decoder is the output of the encoder and the predicted result, and the output is the conditional probability of the final result.

Optionally, the present application is described by taking an execution subject of each embodiment as an example of an electronic device with processing capability, where the electronic device may be a desktop computer, a notebook computer, a tablet computer, a mobile phone, a server, and the like, and the present embodiment does not limit the device type of the computer device.

FIG. 1 is a flow chart of a method for extracting structured data from unstructured data according to one embodiment of the present application. The method at least comprises the following steps:

step 101, a target document is obtained, wherein the target document comprises unstructured data to be extracted.

The target document may be input by a user through the current electronic device; or, it is transmitted by other devices, and the source of the target document is not limited in this embodiment.

In the embodiment, the unstructured data to be extracted is stored in the target document in a non-fixed format, and a historical stock document exists, so that the entry cost can be reduced. In one example, the target document is a word document, and the unstructured data to be extracted is not stored in a table form in the word document, is stored in a non-fixed format, and has a historical inventory document.

Such as: the content in the target document includes: time for listing the land parcel: 30/10/2018 to 12/11/2018, 16: 30 hours. Wherein, the data of "10 months and 30 days in 2018" and the data of "11 months and 12 days in 2018" 16: 30 "is the unstructured data to be extracted.

And 102, performing data segmentation on the target document to obtain a plurality of data fragments in the target document.

In one example, data cutting is performed on a target document to obtain a plurality of data segments in the target document, including: extracting text content in the target document through a file content extraction tool; and performing data segmentation on the text content according to the preset punctuation marks to obtain a plurality of data fragments.

The file content extraction tool is used for extracting the text content of the target document, such as: the file content extraction tool is a tool kit tika for extracting file content in Apache (Apache HTTP Server, Apache), or a self-development tool, and the implementation manner of the file content extraction tool is not limited in this embodiment.

The predetermined punctuation marks include, but are not limited to, at least one of the following: periods, commas and semicolons. Certainly, the preset punctuation mark can be set adaptively according to the data cutting requirement, and the implementation manner of the preset punctuation mark is not limited in this embodiment.

103, sequentially inputting a plurality of data segments into a pre-trained data classification model to obtain each classification label included in each data segment and data content corresponding to each classification label; the data classification model is obtained by using a plurality of groups of training data for training in advance, and each group of training data comprises: and labeling a plurality of sample data fragments and the classification label corresponding to each sample data fragment.

The classification label is determined based on data extraction requirements of the unstructured data. Such as: the data extraction requirement is to extract purchasing information, the purchasing information includes a cargo list and technical requirements, and the classification label includes each cargo name in the cargo list and each requirement in the technical requirements. For another example: and if the data extraction requirement is to extract the listing time information, the classification label comprises the initial listing time and the final listing time.

Optionally, the plurality of data segments are sequentially input into a pre-trained data classification model, and before obtaining each classification label included in each data segment and the data segment corresponding to each classification label, the data classification model needs to be obtained through training.

The training process of the data classification model comprises the following steps: obtaining a sample document; performing data cutting on the sample document to obtain a plurality of sample data fragments in the sample document; labeling each sample data fragment according to the data extraction requirement to obtain a corresponding classification label; inputting the sample data fragments into a preset neural network model to obtain a model result; and training the neural network model based on a preset loss function, a model result and classification label labels to obtain a data classification model.

Wherein the sample document includes unstructured data related to data extraction requirements.

Labeling each sample data fragment may be performed by using an automatic labeling tool or by using a user to label the sample data fragment manually, and the classification label labeling manner is not limited in this embodiment.

The pre-set penalty function is used to minimize the difference between the model results and the classification label labeling. The predetermined loss function includes, but is not limited to, at least one of the following: negative logarithmic loss function, L1 loss function, and L2 loss function, but of course, in other implementations, the preset loss function may also include other types of loss functions, and this embodiment is not listed here.

Illustratively, the data classification model is built based on the BERT model. In other words, the preset neural network model includes a BERT model, and certainly, the preset neural network model may also be a combination of the BERT model and other neural network models, and the implementation manner of the data classification model is not limited in this embodiment.

And 104, storing each classification label and corresponding data content into a structured database to obtain structured data.

Such as: for data fragment "plot hang time: 30/10/2018 to 12/11/2018, 16: 30 hours later, the corresponding classification labels comprise listing starting time and listing ending time, wherein the data content corresponding to the listing starting time is 2018, 10 months and 30 days; the data content corresponding to the listing deadline is 11 months, 12 days and 16 days in 2018: 30 hours. Accordingly, after storing in the structured database, the structured data is obtained as shown in the following table.

Table one:

starting time of hanging card	30 days in 2018, 10 months
		Hang tag deadline	11/2018, 12/12 16: at 30 hours

And 105, displaying the structured data through the form.

In one example, structured data is displayed through a form, including: the structured data is displayed in the form of a form through a web page.

In summary, in the method for extracting structured data from unstructured data provided in this embodiment, a target document is obtained, where the target document includes unstructured data to be extracted; performing data segmentation on the target document to obtain a plurality of data fragments in the target document; sequentially inputting a plurality of data segments into a pre-trained data classification model to obtain each classification label included by each data segment and data content corresponding to each classification label; the data classification model is obtained by using a plurality of groups of training data for training in advance, and each group of training data comprises: labeling a plurality of sample data fragments and a classification label corresponding to each sample data fragment; storing each classification label and corresponding data content into a structured database to obtain structured data; displaying the structured data through the form; the problem that when data are input in an unstructured data form, different personnel input data in different and uniform modes, and accordingly document input and review efficiency is low can be solved; because unstructured data in the target document can be displayed in the form of structured data, document entry and review efficiency can be improved.

FIG. 2 is a block diagram of an apparatus for extracting structured data from unstructured data according to one embodiment of the present application. The device at least comprises the following modules: document acquisition module 210, data segmentation module 220, data classification module 230, structured storage module 240, and data display module 250.

A document obtaining module 210, configured to obtain a target document, where the target document includes unstructured data to be extracted;

the data cutting module 220 is configured to perform data cutting on the target document to obtain a plurality of data segments in the target document;

the data classification module 230 is configured to sequentially input the plurality of data segments into a pre-trained data classification model, so as to obtain each classification label included in each data segment and data content corresponding to each classification label; the data classification model is obtained by using a plurality of groups of training data for training in advance, and each group of training data comprises: labeling a plurality of sample data fragments and a classification label corresponding to each sample data fragment;

a structured storage module 240, configured to store each classification tag and corresponding data content in a structured database to obtain structured data;

and a data display module 250, configured to display the structured data through a form.

For relevant details reference is made to the above-described method embodiments.

It should be noted that: the device for extracting structured data from unstructured data provided in the above embodiments is only illustrated by the above division of each functional module when extracting structured data from unstructured data, and in practical applications, the above function allocation may be completed by different functional modules as needed, that is, the internal structure of the device for extracting structured data from unstructured data is divided into different functional modules to complete all or part of the above described functions. In addition, the apparatus for extracting structured data from unstructured data and the method for extracting structured data from unstructured data provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in method embodiments and are not described herein again.

FIG. 3 is a block diagram of an apparatus for extracting structured data from unstructured data according to one embodiment of the present application. The apparatus comprises at least a processor 301 and a memory 302.

Processor 301 may include one or more processing cores, such as: 4 core processors, 8 core processors, etc. The processor 301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 301 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 301 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 302 is used to store at least one instruction for execution by processor 301 to implement a method for extracting structured data from unstructured data as provided by method embodiments herein.

In some embodiments, the means for extracting the structured data from the unstructured data may further comprise: a peripheral interface and at least one peripheral. The processor 301, memory 302 and peripheral interface may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface via a bus, signal line, or circuit board. Illustratively, peripheral devices include, but are not limited to: radio frequency circuit, touch display screen, audio circuit, power supply, etc.

Of course, the device for extracting structured data from unstructured data may also include fewer or more components, which is not limited by the embodiment.

Optionally, the present application further provides a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the method for extracting structured data from unstructured data of the above method embodiment.

Optionally, the present application further provides a computer product, which includes a computer-readable storage medium, in which a program is stored, and the program is loaded and executed by a processor to implement the method for extracting structured data from unstructured data of the above-mentioned method embodiment.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of extracting structured data from unstructured data, the method comprising:

and displaying the structured data through a form.

2. The method of claim 1, wherein the data cutting the target document to obtain a plurality of data segments in the target document comprises:

3. The method of claim 1, wherein the classification label is determined based on data extraction requirements of the unstructured data.

4. The method of claim 3, wherein before the data segments are sequentially input into a pre-trained data classification model to obtain each classification label included in each data segment and the data segment corresponding to each classification label, the method further comprises:

obtaining a sample document;

5. The method of claim 4, wherein the sample document includes unstructured data related to the data extraction requirements.

6. The method of claim 1, wherein the data classification model is built based on a bi-directional encoder representation BERT model of a converter.

7. The method of claim 1, wherein the displaying the structured data through a form comprises:

and displaying the structured data in a form through a webpage.

8. The method of claim 1, wherein the target document is a word document, and the unstructured data to be extracted is stored in the word document in a non-fixed format and has a historical inventory document.

9. An apparatus for extracting structured data from unstructured data, the apparatus comprising:

10. The apparatus of claim 9, wherein the target document is a word document, and the unstructured data to be extracted is stored in the word document in a non-fixed format and has a historical inventory document.