WO2021036181A1

WO2021036181A1 - Data extraction method and device, storage medium and equipment

Info

Publication number: WO2021036181A1
Application number: PCT/CN2020/071879
Authority: WO
Inventors: 吴文旷
Original assignee: 北京国双科技有限公司
Priority date: 2019-08-26
Filing date: 2020-01-14
Publication date: 2021-03-04
Also published as: CN111475641B; CN111475641A

Abstract

A data extraction method and device, a storage medium and an equipment. The method comprises: acquiring a manually annotated triad on the basis of tags manually added for characters in a first group of documents (S101); determining an automatically annotated triad according to a triad identified by a preset model from a second group of documents (S102), the preset model being a model that is preset and fits to the type of the second group of documents, and the model being obtained by training through using training data that comprise the manually annotated triad and the first group of documents; and using the manually annotated triad and the automatically annotated triad as knowledge data extracted from the documents (S103). The method can improve the use ratio of useful information in the documents, and the obtained knowledge data are more comprehensive.

Description

Data extraction method, device, storage medium and equipment

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 26, 2019, the application number is 201910789378.7, and the invention title is "a data extraction method, device, storage medium and equipment", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

The present invention relates to the field of electronic information, in particular to a data extraction method, device, storage medium and equipment.

Background technique

In the process of oilfield exploration, development, and production, a large number of scientific and technological achievements in the form of documents have been accumulated, such as high-value documents such as exploration deployment, oil and gas reservoir descriptions, development plans, research reports, and archives. There are a lot of useful information in these documents, such as: the name of the oil field, the time of development and production, daily oil production, oil and gas reservoir traps, reservoir lithology, thickness, net-to-gross ratio, etc. This information has a strong auxiliary effect for scientific research personnel engaged in exploration and development to quickly retrieve data, analyze data, and discover the potential value of data.

However, the useful information in the document is unstructured, which is inconvenient for researchers to query and use, that is, the utilization rate of the useful information in the document is low.

Summary of the invention

In view of the above-mentioned problems, the present invention provides a data extraction method, device, storage medium, and equipment that overcome the above-mentioned problems or at least partially solve the above-mentioned problems.

With the above technical solutions, the present invention provides

This application provides a data extraction method, including:

Based on the tags manually added to the characters in the first set of documents, obtain the manually labeled triples;

According to the triples identified from the second set of documents by the preset model, determine to automatically mark the triples; wherein, the preset model is a preset model adapted to the type of the second set of documents, so The model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;

The manual labeling triples and the automatic labeling triples are used as knowledge data extracted from the document.

Optionally, the process of obtaining the tags manually added to the characters in the first group of documents includes:

Based on the operation of manually selecting characters in the first set of documents, a list of candidate entity tags is displayed, the candidate entity tags being determined according to the business requirements of the field to which the first set of documents and the second set of documents belong ；

Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;

Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;

The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.

Optionally, the training data further includes at least one of the following:

The position of the elements in the manually labeled triples in the first set of documents, the distance between the elements in the manually labeled triples in the first set of documents, and the manually labeled triples The grammatical relationship between the elements in the group in the first group of documents.

Optionally, the determining to automatically label the triples based on the triples identified from the second set of documents according to the preset model includes:

Annotate target triples, where the target triples are at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples; the preset Among the triples identified by the model from the second set of documents, the triples that contradict the manually labeled triples; the missing triples;

Obtain a manual correction result for the target triplet and use it as the automatically labeled triplet.

Optionally, use the correction result of the target triple and the second set of documents to retrain the model.

Optionally, the type of the first group of documents is the same as the type of the second group of documents;

The process of determining the model adapted to the type of the second set of documents includes:

In the training process, the model with the highest accuracy of the triples identified from the first set of documents is used as a model adapted to the type of the second set of documents.

This application also provides a data extraction device, including:

The first obtaining module is configured to obtain manually labeled triples based on the tags manually added to the characters in the first set of documents;

The determining module is used to determine the automatically labeled triples according to the triples identified from the second set of documents according to the preset model; wherein, the preset model is a preset model suitable for the type of the second set of documents. Configured model, the model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;

The execution module is configured to use the manual labeling triples and the automatic labeling triples as knowledge data extracted from the document.

Optionally, it further includes: a second obtaining module, configured to obtain the tags manually added to the characters in the first group of documents;

The second acquiring module is specifically configured to display a list of candidate entity tags based on an operation of manually selecting characters in the first set of documents, and the candidate entity tags are based on the first set of documents and the first set of documents. Determine the business requirements of the field to which the second set of documents belong;

Optionally, the training data further includes at least one of the following:

The position of the element in the manually-labeled triplet in the first document, the distance between the elements in the manually-labeled triplet in the first document, and the position of the manually-labeled triplet in the first document The grammatical relationship between the elements of the group in the first document.

Optionally, the determining module is configured to determine the triples identified automatically from the second document according to the preset model, including:

The determining module is specifically configured to annotate target triples, the target triples being at least one of the following: among the triples identified by the preset model from the second document, there are contradictory triples Tuples; among the triples identified by the preset model from the second document, the triples that contradict the manually labeled triples; the missing triples;

Optionally, it also includes: training module;

The training module is configured to use the correction result of the target triplet and the second document to retrain the model.

Optionally, it further includes a fitting model determination module, which is used to use the model with the highest accuracy of the triples identified from the first document in the training process as the model suitable for the type of the second document. Matching model; the type of the first document is the same as the type of the second document.

The present application also provides a storage medium, the storage medium includes a stored program, wherein the program executes any one of the aforementioned data extraction methods.

The present application also provides a device, including: a processor, a memory, and a bus; the processor and the memory are connected through the bus;

The memory is used to store a program, and the processor is used to run a program, wherein, when the program is running, any one of the aforementioned data extraction methods is executed.

In the data extraction scheme provided by the present invention, based on the tags manually added to the characters in the first set of documents, the manually labeled triples are obtained, and the triples identified from the second set of documents according to the preset model are determined automatically. Annotate triples. Manually annotate triples and automatically annotate triples as knowledge data extracted from documents. That is, the knowledge data obtained by the present invention is a triplet, because the triplet is a structured data, which is convenient for users to query and use. Therefore, the solution of the present invention can improve the utilization rate of useful information in the document.

In addition, in the present invention, both manual labeling of triples and automatic labeling of triples are used as knowledge data, wherein the automatic labeling of triples is determined by determining the triples identified from the second set of documents according to a preset model Yes, the model is a model adapted to the type of the second set of documents obtained by training using manual annotation of triples and the first set of documents as training samples. Since the manually labeled triples are the triples in the training sample, the automatically labeled triples are determined based on the triples obtained by the preset model in the test process, that is, the knowledge data obtained by the present invention includes both training The triples in the sample also include the triples obtained from the triples obtained according to the test process, which further makes the knowledge data obtained by the present invention more comprehensive.

The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other objectives, features and advantages of the present invention more obvious and understandable. In the following, specific embodiments of the present invention will be cited.

Description of the drawings

By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only used for the purpose of illustrating the preferred embodiments, and are not considered as a limitation to the present invention. Also, throughout the drawings, the same reference symbols are used to denote the same components. In the attached picture:

Fig. 1 shows a schematic flowchart of a data extraction method disclosed in an embodiment of the present application;

Fig. 2 shows a schematic flowchart of a model training method disclosed in an embodiment of the present application;

FIG. 3 shows a schematic flowchart of another data extraction method disclosed in an embodiment of the present application;

Figure 4 shows a schematic structural diagram of a data extraction device disclosed in an embodiment of the present application;

Fig. 5 shows a schematic structural diagram of a device disclosed in an embodiment of the present application.

detailed description

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the drawings show exemplary embodiments of the present disclosure, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

In the embodiments of the present application, the documents used for training the model are referred to as the first set of documents, and the documents in the testing process are referred to as the second set of documents. Specifically, which documents are the first group of documents and which are the second group of documents can be determined according to actual conditions, and this embodiment does not limit it.

Figure 1 is a data extraction method provided by an embodiment of this application, including the following steps:

S101: Based on the tags manually added to the characters in the first set of documents, obtain a manually labeled triplet.

In this step, the triplet generally includes two entities and an entity relationship, where the entity relationship is used to reflect the relationship between the two entities.

For example, in the field of petroleum exploration, the artificially labeled triples obtained from the first set of documents can be "a oil field 30 million tons production", where "a oil field" is an entity, and "30 million tons" is also an entity. "Production" is the relationship between the entity "a oil field" and the entity "30 million tons".

S102: Determine to automatically label the triples according to the triples identified from the second set of documents according to the preset model.

The preset model is a preset model adapted to the type of the second set of documents. The model is trained using training data. The training data includes manually labeled triples and the first set of documents.

Specifically, the training process of the model includes: forward propagation process and back propagation process. In the forward propagation process, the model identifies triples from the first set of documents. In the backpropagation process, according to the preset loss function, the loss function value between the identified triplet and the manually labeled triplet is calculated, and the loss function value is reduced as the goal, and the parameters in the model are adjusted. Continue to train the model after adjusting the parameters according to the training process until the loss function value is not greater than the preset threshold, and the model training is completed.

It should be noted that the specific content of the loss function can be referred to the prior art, which will not be repeated here.

The acquisition process of manually labeling triples and the adaptation process between the model and the document type will be described in the embodiment shown in FIG. 2.

S103: Use manual labeling of triples and automatic labeling of triples as knowledge data extracted from the document.

Because triples are structured data, it is convenient for users to query and use. Therefore, in this embodiment, the characters in the document are converted into triples, which can improve the utilization rate of useful information in the document.

In addition, in this embodiment, both manual labeling of triples and automatic labeling of triples are used as knowledge data, where the automatic labeling of triples is determined based on the triples identified from the second set of documents based on the preset model. Obtained, the model is obtained by manually labeling triples and the first set of documents as training samples. Since the manually labeled triples are the triples in the training sample, the automatically labeled triples are determined based on the triples obtained by the preset model in the test process, that is, the knowledge data obtained by the present invention includes both training The triples in the sample also include the triples obtained from the triples obtained according to the test process, which further makes the knowledge data obtained by the present invention more comprehensive.

In addition, the preset model is a preset model adapted to the type of the second set of documents, so that the triples of the second set of documents can be more accurately identified.

It should be noted that the above-mentioned first set of documents and second set of documents can be documents in any field, that is, the above-mentioned data extraction method can be applied to any field that can generate documents. In the following embodiments, oil exploration Take the domain as an example.

Figure 2 is a model training method provided by an embodiment of the application, including the following steps:

S201. Obtain training samples.

In this embodiment, the training sample includes: the first set of documents and the triples marked in the first set of documents.

Specifically, in this step, the process of obtaining the triples marked in the first set of documents includes steps A1 to A6:

A1. Get the first set of documents.

In this embodiment, the first set of documents are documents generated in the process of oil and gas exploration, development, and production, and the format of the first set of documents may be Word, PPT, PDF, Excel, JPG, PNG and other formats. The specific method for obtaining the first set of documents can be: receiving the first set of documents, and extracting the characters in the documents (if the documents are picture documents, perform OCR recognition on the characters), and obtain the recognized characters, which are manual Provide conditions for labeling triples of characters in the first set of documents.

A2. Based on the operation of manually selecting characters in the first set of documents, a list of to-be-selected entity tags is displayed.

In this step, in the case of manually selecting characters in the first set of documents, a list of entity tags to be selected is displayed, that is, a list of entity tags for manual selection is displayed.

In this embodiment, the candidate entity tags set in the list of candidate entity tags are determined according to the business requirements of the petroleum exploration field to which the first group of documents and the second group of documents belong. For example, the physical tags in the field of petroleum exploration include, but are not limited to: the name of the oil and gas field, the time of development and production, daily oil production, oil and gas reservoir traps, reservoir lithology, thickness, and net-to-gross ratio.

A3. Use the label manually selected from the list of entity labels to be selected as the entity label of the selected character to obtain the marked character.

In this step, for the characters that have been selected in the first set of documents (that is, the selected characters), manually select the label of the entity to which the selected character belongs from the list of to-be-selected entity labels, as the entity label of the selected character, For the convenience of description, the selected characters marked with entity tags are called marked characters.

After the operation of this step, there may be multiple selected characters in the first set of documents, and further, there may be multiple labeled characters.

A4. Based on the operation of manually selecting labeled characters, a list of to-be-selected relationships between entity tags is displayed.

There is a certain relationship between multiple labeled characters in multiple labeled characters. Therefore, in this step, when the labeled characters have been manually selected, a list of to-be-selected relationships between entity tags is displayed for manual use. Select the relationship between the selected labeled characters from the list of to-be-selected relationships.

In this embodiment, the candidate relationship between the entity tags is determined according to the business requirements of the petroleum exploration field to which the first set of documents and the second set of documents belong. For example, the candidate relationship between the entity tags includes the output of the entity 1. Among them, entity 1 is the number of the marked character as the entity. Assuming that in a passage "The output of oil field a is 30 million tons", the selected character "a oil field" is marked as entity 1, and "30 million tons" is marked as entity 2, then the relationship label of entity 2 is Is the output of entity 1.

A5. Use the relationship manually selected from the list of to-be-selected relationships as the relationship label of the selected labeled character.

The above steps A1 to A5 are the process of manually adding tags to the characters in the first set of documents.

Through the above steps A1 to A5, the entity tags and the relationship tags of the relationships between entities indicated by different entity tags are obtained, and the corresponding relationship between the entity tags and the relationship tags is also obtained.

A6. Obtain manually labeled triples based on manually added entity tags and relationship tags.

In this step, the triples indicated by each corresponding relationship are obtained from the entity tag, the relationship tag, and the corresponding relationship. Wherein, the acquisition process of the triples indicated by each corresponding relationship is the same. For any corresponding relationship, the entity indicated by the entity tag in the corresponding relationship and the relationship indicated by the relationship tag constitute a triple, that is, an artificially labeled triple is obtained.

In this embodiment, the first set of documents and the manual annotation of triples from the first set of documents can be used as training samples.

Optionally, in order to improve the accuracy of identifying triples from the second set of documents by the trained model, that is, the accuracy of identifying triples from the second set of documents by the trained model. In this embodiment, the training sample also includes: manually marking the position of the elements in the triplet in the first set of documents, manually marking the distance between the elements in the triplet in the first set of documents, and manually marking the three The grammatical relationship between the elements in the tuple in the first group of documents.

Among them, each manually labeled triple corresponds to the distance between groups and the grammatical relationship between groups. For any manually labeled triple, the distance between the elements in the manually labeled triple refers to the distance between the groups in the first set of documents. : The distance between the positions of the elements in the triples in the first set of documents is manually marked. Specifically, the distance can be Euclidean distance, of course, it can also be other forms of distance. The specific form of distance is not used in this embodiment. limited. The grammatical relationship between the elements in the manually labeled triplet in the first set of documents refers to: the grammatical relationship between the elements in the manually labeled triplet in the first set of documents, among which the examples of the grammatical relationship are mainly, Predicate, object, definite, adverb, complement, system, table, etc. Also take the manual labeling of the triplet "a oilfield 30 million tons production" as an example. If the sentence in the first set of documents is "a oilfield output is 30 million tons", then the triplet "a" is manually labeled After the output of 30 million tons of oil field", the grammatical relationship of the elements in the triplet in the first set of documents is also marked: "production" is marked as "subject", "a oil field" is marked as "attribute" and "three". Ten million tons" is marked as "object".

S202: Use training samples to train multiple models separately to obtain multiple models after training.

In this step, multiple models may include: naive Bayes model, support vector machine model (for example, SVM), word embedding model (for example, word2vec), recurrent neural network model (for example, RNN), and long and short-term memory network Model (for example, LSTM).

Specifically, the training process for any model is in the prior art, and will not be repeated here. In this embodiment, the trained model has the function of identifying triples from documents in the field of petroleum exploration.

The applicant discovered in the process of research that in the field of petroleum exploration, due to the different structure of the model, different models after training have different test accuracy for certain types of documents. Therefore, in order to improve the model's recognition of triples Accuracy, optionally, in this embodiment, different types of first set of documents can be selected to train multiple models separately. For any type, use the first set of documents of that type to train each model, and Comparing the accuracy of the output results of multiple models (the smallest iteration value of the loss function can be used as the accuracy score), select the model with the most accurate output (the smallest value of the loss function) as the type of document adaptation model.

For the process of training the model using any type of the first set of documents, refer to the process shown in Figure 2.

FIG. 3 is another data extraction method provided by an embodiment of this application, including the following steps:

S301. Obtain a manually labeled triplet based on the tags manually added to the characters in the first set of documents.

For the specific implementation principle of this step, please refer to S101, which will not be repeated here.

S302: According to the type of the second group of documents, a model adapted to the type of the second group of documents is selected as the target model.

S303. Input the second set of documents into the target model to obtain the triples identified by the target model from the second set of documents.

S304: Determine to automatically label the triples according to the triples recognized by the target model from the second set of documents.

In this step, automatic labeling of triples refers to triples that can be used as knowledge data. Specifically, according to the triples identified from the second set of documents based on the preset model, the method for determining the automatic labeling of the triples may include:

The first method: use the triples identified by the target model as the automatically labeled triples.

The second method: label the target triples from the triples identified by the target model, obtain the manual correction results of the target triples, and use the corrected triples of the target triples as the automatic labeling triples group.

Among them, the target triples are at least one of the following: 1. There are contradictory triples in the triples identified by the target model from the second set of documents; 2. The triples identified by the target model from the second set of documents Among the tuples, the triples that contradict the manually labeled triples, and the triples with missing items.

Specifically, among the triples identified by the target model from the second set of documents, contradictory triples refer to: triples with contradictions between the triples identified by the target model from the second set of documents. For example, the triples identified by the target model from the second set of documents include: "Maximum oil volume is 1000 tons" and "Maximum oil volume is 5000 tons" two triples, because the value of the maximum oil volume is unique Therefore, these two triples are contradictory triples.

The third way: Take the triples except the target triples among the triples recognized by the target model, and the triples after the correction of the target triples, as the automatically labeled triples.

S305. Use manual labeling of triples and automatic labeling of triples as knowledge data extracted from the document.

S306. Save the knowledge data extracted from the document in a preset knowledge graph library.

The specific implementation process of this step is in the prior art, and will not be repeated here.

Optionally, in order to improve the accuracy of the trained model in recognizing triples from documents, the target can also be labeled with triple corrected triples and the second set of documents as training samples, and continue with the trained model After training, the updated model is obtained. In the case of subsequent identification of triples from the document, the updated model is used to identify the triples from the document.

The embodiments of this application have the following beneficial effects:

Beneficial effect 1.

In this embodiment, based on the tags manually added to the characters in the first set of documents, the manually labeled triples are obtained, and the triples identified from the second set of documents are determined to be automatically labeled according to the preset model. Manual labeling of triples and automatic labeling of triples are used as knowledge data extracted from documents. That is, the knowledge data obtained by the present invention is a triplet, and the triplet is a structured data, which is convenient for users to query and use. Therefore, the solution of the present invention can improve the utilization rate of the knowledge data in the document.

In addition, in this embodiment, both manual labeling of triples and automatic labeling of triples are used as knowledge data, where the automatic labeling of triples is determined based on the triples identified from the second set of documents based on the preset model. Obtained, the preset model is a preset model adapted to the type of the second set of documents, and the model is obtained by training using manual annotation of triples and the first set of documents as training samples. Since the manually labeled triples are the triples in the training sample, the automatically labeled triples are determined based on the triples obtained by the preset model in the testing process, that is, the knowledge data obtained in this embodiment includes both The triples in the training samples also include the triples obtained according to the triples obtained in the testing process, which further makes the knowledge data obtained in this embodiment more comprehensive.

Beneficial effect two,

Compared with the prior art that manually recognizes triples from documents as knowledge data, the embodiment of the present application uses automatic and semi-automatic methods to recognize triples from documents as knowledge data. Therefore, the speed and speed of knowledge data extraction can be improved. effectiveness.

FIG. 4 is a data processing device provided by an embodiment of the application, including: a first acquiring module 401, a determining module 402, and an executing module 403.

Wherein, the first obtaining module 401 is configured to obtain manually labeled triples based on the tags manually added to the characters in the first set of documents. The determining module 402 is configured to determine the triples to be automatically labeled according to the triples identified from the second set of documents according to the preset model. Wherein, the preset model is a preset model adapted to the type of the second set of documents, and the model is obtained by training using training data, and the training data includes manually labeled triples and the first set of documents. The execution module 403 is used for manually labeling triples and automatically labeling triples as knowledge data extracted from the document.

Optionally, the device further includes: a second obtaining module 404 configured to obtain tags manually added to the characters in the first set of documents.

The second acquisition module 404 is specifically configured to display a list of candidate entity tags based on the operation of manually selecting characters in the first set of documents. The candidate entity tags are determined according to the business requirements of the field to which the first set of documents and the second set of documents belong . The label manually selected from the list of to-be-selected entity labels is used as the entity label of the selected character, and the marked character is obtained. Based on the operation of manually selecting annotated characters, a list of candidate relationships between entity tags is displayed. The candidate relationships between entity tags are determined according to the business requirements of the fields to which the first set of documents and the second set of documents belong. The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.

Optionally, the training data further includes at least one of the following: manually labeling the position of the elements in the triplet in the first set of documents, manually labeling the distance between the elements in the triplet in the first set of documents, and manually Mark the grammatical relationship between the elements in the triples in the first set of documents.

Optionally, the determining module 402 is configured to determine the triples identified automatically from the second set of documents according to the preset model, including: the determining module 402 is specifically configured to annotate the target triples, the target triples The group is at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples, and the preset model identifies the triples from the second set of documents. Mark triples with contradictory triples; triples with missing items. Obtain the manual correction result of the target triplet as an automatic labeling triplet.

Optionally, the device further includes: a training module 405. The training module 405 is used to retrain the model using the correction result of the target triplet and the second set of documents.

Optionally, the device further includes: an adaptation model determination module 406, configured to use the model with the highest accuracy of the triples identified from the first set of documents in the training process as the type of the second set of documents. For the adapted model, the type of the first group of documents is the same as the type of the second group of documents.

The data extraction device includes a processor and a memory. The first acquisition module, the determination module, and the execution module are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor contains the kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, by adjusting the kernel parameters to improve the utilization of useful information in the document.

The embodiment of the present invention provides a storage medium on which a program is stored, and the data extraction method is implemented when the program is executed by a processor.

The embodiment of the present invention provides a processor, the processor is used to run a program, wherein the data extraction method is executed when the program is running.

The embodiment of the present invention provides a device. The device includes at least one processor, and at least one memory and a bus connected to the processor; wherein the processor and the memory communicate with each other through the bus; the processor is used to call Program instructions to perform the above-mentioned data extraction method. The devices in this article can be servers, PCs, PADs, mobile phones, etc.

This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps:

According to the triples recognized by the preset model from the second set of documents, determine to automatically label the triples; where the preset model is a preset model that fits the type of the second set of documents, and the model is trained using training data Obtained, the training data includes manually labeled triples and the first set of documents;

Manual labeling of triples and automatic labeling of triples are used as knowledge data extracted from documents.

The process of obtaining the tags manually added to the characters in the first set of documents includes:

Based on the operation of manually selecting characters in the first set of documents, display a list of candidate entity tags, which are determined according to the business requirements of the field to which the first set of documents and the second set of documents belong;

Based on the operation of manually selecting annotated characters, a list of candidate relationships between entity tags is displayed, and the candidate relationships between entity tags are determined according to the business requirements of the field to which the first set of documents and the second set of documents belong;

The training data also includes at least one of the following: manually labeling the position of the element in the triplet in the first set of documents, manually labeling the element in the triplet in the first set of documents, and manually labeling the triplet The grammatical relationship between the elements in the first group of documents.

Determine the automatic labeling of the triples based on the triples identified from the second set of documents based on the preset model, including:

Annotate the target triples, and the target triples are at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples; the preset model identifies from the second set of documents Among the three-tuples out, the three-tuples that contradict the manual-labeled three-tuples; the missing three-tuples;

Obtain the manual correction result of the target triplet as an automatic labeling triplet.

Using the correction result of the triplet and the second set of documents, the model is retrained.

The type of the first group of documents is the same as the type of the second group of documents;

The process of determining the model adapted to the type of the second group of documents includes: during the training process, the model with the highest accuracy of the triples identified from the first group of documents is used as the model suitable for the type of the second group of documents Matching model.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the device includes one or more processors (CPUs), memory, and buses. The equipment may also include input/output interfaces, network interfaces, etc., as shown in Figure 5.

The memory may include non-permanent memory in a computer-readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip. The memory is an example of a computer-readable medium.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, product or device that includes a series of elements includes not only those elements, but also Other elements that are not explicitly listed, or they also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity or equipment that includes the element.

Those skilled in the art should understand that the embodiments of the present application can be provided as a method, a system, or a computer program product. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The above are only examples of the application, and are not used to limit the application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A data extraction method, characterized in that it comprises:

Based on the tags manually added to the characters in the first set of documents, obtain the manually labeled triples;

According to the triples identified from the second set of documents by the preset model, determine to automatically mark the triples; wherein, the preset model is a preset model adapted to the type of the second set of documents, so The model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;

The manual labeling triples and the automatic labeling triples are used as knowledge data extracted from the document.
The method according to claim 1, wherein the process of obtaining the tags manually added to the characters in the first set of documents comprises:

Based on the operation of manually selecting characters in the first set of documents, a list of candidate entity tags is displayed, the candidate entity tags being determined according to the business requirements of the field to which the first set of documents and the second set of documents belong ；

Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;

Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;

The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
The method according to claim 1, wherein the training data further comprises at least one of the following:

The position of the elements in the manually labeled triples in the first set of documents, the distance between the elements in the manually labeled triples in the first set of documents, and the manually labeled triples The grammatical relationship between the elements in the group in the first group of documents.
The method according to claim 1, wherein the determining the triples identified from the second set of documents according to the preset model to automatically label the triples comprises:

Annotate target triples, where the target triples are at least one of the following: among the triples identified by the preset model from the second set of documents, there are contradictory triples; the preset Among the triples identified by the model from the second set of documents, the triples that contradict the manually labeled triples; the missing triples;

Obtain a manual correction result for the target triplet and use it as the automatically labeled triplet.
The method according to claim 4, wherein the correction result of the target triple and the second set of documents are used to retrain the model.
The method according to claim 1, wherein the type of the first group of documents is the same as the type of the second group of documents;

The process of determining the model adapted to the type of the second set of documents includes:

In the training process, the model with the highest accuracy of the triples identified from the first set of documents is used as a model adapted to the type of the second set of documents.
A data extraction device is characterized in that it comprises:

The first obtaining module is configured to obtain manually labeled triples based on the tags manually added to the characters in the first set of documents;

The determining module is used to determine the automatically labeled triples according to the triples identified from the second set of documents according to the preset model; wherein, the preset model is a preset model suitable for the type of the second set of documents. Configured model, the model is obtained by training using training data, and the training data includes the manually labeled triples and the first set of documents;

The execution module is configured to use the manual labeling triples and the automatic labeling triples as knowledge data extracted from the document.
8. The device according to claim 7, further comprising: a second obtaining module, configured to obtain the tags manually added to the characters in the first set of documents;

The second acquiring module is specifically configured to display a list of candidate entity tags based on an operation of manually selecting characters in the first set of documents, and the candidate entity tags are based on the first set of documents and the first set of documents. Determine the business requirements of the field to which the second set of documents belong;

Use the label manually selected from the list of to-be-selected entity labels as the entity label of the selected character to obtain the marked character;

Based on the operation of manually selecting the marked characters, a list of to-be-selected relationships between entity tags is displayed, and the to-be-selected relationships between the entity tags are based on the business in the field to which the first set of documents and the second set of documents belong Demand determination;

The relationship manually selected from the list of to-be-selected relationships is used as the relationship label of the selected labeled character.
A storage medium, characterized in that the storage medium includes a stored program, wherein the program executes the data extraction method according to any one of claims 1 to 6.
A device, characterized by comprising: a processor, a memory, and a bus; the processor and the memory are connected through the bus;

The memory is used to store a program, and the processor is used to run a program, wherein the data extraction method according to any one of claims 1 to 6 is executed when the program is running.