CN110909544A

CN110909544A - Data processing method and device

Info

Publication number: CN110909544A
Application number: CN201911143298.0A
Authority: CN
Inventors: 韩庆宏
Original assignee: Beijing Shannon Huiyu Technology Co Ltd
Current assignee: Beijing Shannon Huiyu Technology Co Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-03-24

Abstract

The invention provides a data processing method and a data processing device, wherein the method comprises the following steps: acquiring a text and words needing coreference resolution; generating a question sentence according to the words, and finding out characters capable of answering the question sentence from the text as common referents of the words; and extracting the co-reference words by using a candidate text extractor, and completing the co-reference resolution of the words. By the data processing method and the data processing device, the co-referent of the words can be found out from the text in a question-and-answer mode, and the accuracy of co-referent resolution is greatly improved.

Description

Data processing method and device

Technical Field

The invention relates to the technical field of computers, in particular to a data processing method and device.

Background

Currently, to avoid repetition, it is customary in the text to use pronouns, referents and abbreviations to refer to the aforementioned words. For example, at the beginning of the text, "Harbin university of industry" may be written, followed by "Harvard" and "Gongda", etc., and further reference to "this university", "her", etc.; this phenomenon is called a common finger phenomenon. It is very difficult for a computer to perform natural language processing to recognize words having a common phenomenon from text. The computer can carry out coreference resolution on the text, and then words with coreference phenomena can be identified from the text. The coreference resolution is to find all the pronouns of the same word from the text.

In the related art, the coreference resolution method is often based on similarity comparison of tuples to obtain results. Resulting in low accuracy of coreference resolution.

Disclosure of Invention

In order to solve the above problem, embodiments of the present invention provide a data processing method and apparatus.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

acquiring a text and words needing coreference resolution;

generating a question sentence according to the words, and finding out characters capable of answering the question sentence from the text as common referents of the words;

and extracting the co-reference words by using a candidate text extractor, and completing the co-reference resolution of the words.

In a second aspect, an embodiment of the present invention further provides a data processing apparatus, including:

the acquisition module is used for acquiring the text and the words needing coreference resolution;

the processing module is used for generating question sentences according to the words and phrases and finding out characters capable of answering the question sentences from the texts to be used as common referents of the words and phrases;

and the extraction module is used for extracting the common referent by using a candidate text extractor and finishing the common referent resolution of the terms.

In a third aspect, the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the method in the first aspect.

In a fourth aspect, embodiments of the present invention also provide a data processing apparatus, which includes a memory, a processor, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor to perform the steps of the method according to the first aspect.

In the solutions provided in the foregoing first to fourth aspects of the embodiments of the present invention, a question sentence is generated according to an obtained word, and a character capable of answering the question sentence is found from the text as a common referent of the word, and compared with a method of performing common referent resolution based on tuple similarity comparison in the related art, the method and the apparatus can find the character capable of answering the question sentence from the text as the common referent of the word through the question sentence generated by the word, and find the common referent of the word from the text in a question-and-answer manner, thereby greatly improving accuracy of the common referent resolution.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a data processing method according to embodiment 1 of the present invention;

fig. 2 is a schematic structural diagram of a data processing apparatus according to embodiment 2 of the present invention;

fig. 3 is a schematic structural diagram of another data processing apparatus provided in embodiment 3 of the present invention.

Detailed Description

Currently, to avoid repetition, it is customary in the text to use pronouns, referents and abbreviations to refer to the aforementioned words. For example, at the beginning of the text, "Harbin university of industry" may be written, followed by "Harvard" and "Gongda", etc., and further reference to "this university", "her", etc.; this phenomenon is called a common finger phenomenon. It is very difficult for a computer to perform natural language processing to recognize words having a common phenomenon from text. The computer can carry out coreference resolution on the text, and then words with coreference phenomena can be identified from the text. The coreference resolution is to find all the pronouns of the same word from the text. In the related art, the coreference resolution method is often based on similarity comparison of tuples to obtain results. Resulting in low accuracy of coreference resolution.

Based on this, the embodiment provides a data processing method and device, which can find out characters capable of answering the question sentences from the text as the common-meaning words of the words through the question sentences generated by the words needing common-meaning resolution, thereby greatly improving the accuracy of the common-meaning resolution.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Example 1

The embodiment provides a data processing method, and an execution main body is a server.

The server may adopt any computing device in the prior art that can process the text according to the words and perform coreference resolution on the words, and the details are not repeated here.

Referring to a flowchart of a data processing method shown in fig. 1, the present embodiment provides a data processing method, including the following specific steps:

and step 100, acquiring the text and the words needing coreference resolution.

In the step 100, the text may be a text that is input to the server by the staff.

In one embodiment, the text may be: "Donald Turpu (Donald Trump), born 6.8.1946 in New York, Congress and Party politicians, entrepreneurs, traders, 45 th United states President. … … chump played a trade battle between the middle and the united states, the … … special prank government declared a 10% tariff on the hua plus 2000 billion dollar imported goods, which formally became effective 24 months 9 in 2018, and the tariff rate increased to 25% … … Trump against … … he solitary … … in 2019.

After reading the above text, it is found that if "tangard-tlamp" is used as a word, then the co-referent of "tangard-tlamp" in the text includes but is not limited to: "Donald Trump", "45 th United states President", "Chunpu", "Trump" and "him".

In order to enable the server to find out the "tang nad tlamp" as a co-referent word when the word needs to be resolved, the worker may input the "tang nad tlamp" as a word into the server, and enable the server to find out all the co-referents of the "tang nad tlamp" from the text, thereby resolving the tang nad tlamp.

And 102, generating a question sentence according to the words, and finding out characters capable of answering the question sentence from the text to be used as common referents of the words.

In order to find out the co-referent of the word from the text, the above step 102 may perform the following steps (1) to (4):

(1) acquiring a question template, filling the words into the question template, and generating question sentences related to the words;

(2) splicing the question sentence with characters in the text to obtain a spliced text;

(3) processing the spliced text by using a pre-training model (BERT) to obtain vector representation of each character in the spliced text;

(4) and finding out characters capable of answering the question sentence from the spliced text as common referents of the words.

In the step (1), the question template is cached in the server, and is used for storing a question frame sentence capable of prompting the server to find out the common referent of the word from the text.

The question frame sentence is a question sentence which needs to be filled with a gap and is incomplete, such as: the question architecture sentence may be, but is not limited to: all references to "(") are to which "and" he "in the text" are to be referring () ".

Therefore, the words to be coreferenced and resolved are filled in the brackets of the question frame sentences in the question template, so that the question sentences related to the words can be generated.

In one embodiment, when the words to be coreferenced are, the question sentences obtained after "filling out the words" tang nad tellur "are all the pronouns of the question frame sentence" () are: "all pronouns of Ten, Ten refer to which.

As can be seen from the description of the step (1), words needing coreference resolution can be filled in the problem template to generate the problem sentences related to the words, so that coreference resolution can be performed on different words, and the method is flexible and convenient to operate and has interpretability.

In the step (2), the server may adopt any method capable of splicing characters in the prior art to splice the question sentence and the characters in the text to obtain a spliced text. And are not described in detail herein.

In the step (3) above, the BERT, which is operated in the server.

The server processes the spliced text by using the BERT to obtain a process of vector representation of each character in the spliced text, which is the prior art and is not described herein again.

The characters may be, but are not limited to: words, phrases and phrases.

Wherein, the step (4) may specifically execute the following process:

and processing the vector representation of each character in the spliced text by utilizing a machine reading understanding model, and finding out characters capable of answering the question sentence from each character of the text to be used as common referents of the words.

And the machine reads an understanding model and runs in the server.

Here, the process of finding out the characters capable of answering the question sentences from the spliced text by using the machine reading understanding model from the vector representations of the characters in the spliced text is a process of finding out the answers capable of answering the question sentences from the vector representations of the characters in the text of the spliced text by using the spliced text containing the question sentences which need to be coreference resolved. Namely, in a question-and-answer mode, common referents of the words are extracted from the text. The specific processing procedure of the machine reading understanding model is the prior art, and is not described in detail herein.

As can be seen from the description in the steps (1) to (4), a question-answering framework based on a machine reading understanding model is used, question sentences are generated based on words needing coreference resolution, and the machine reading understanding model is enabled to find out characters capable of answering the question sentences from a spliced text as coreference words of the words by using the question sentences containing the words needing coreference resolution; the question-answering mechanism of natural language is skillfully used, and the coreference words of the words needing coreference resolution can be more accurately extracted from the text; moreover, the text and the problem sentence are processed by using a pre-training model and a machine reading understanding model at the front edge in the natural language processing, so that the accuracy of extracting the common-meaning word of the word needing common-meaning resolution from the text can be further improved, and the optimal effect is obtained.

And 104, extracting the common referents by using a candidate text extractor, and completing common reference resolution of the terms.

In the above step 104, the candidate text extractor can be regarded as a sequence annotation model, i.e. the sequence annotation model can use BIEO (B, I, E, O represents the start position B of the co-referent, the middle position I of the co-referent, the end position E of the co-referent, and O in no co-referent, respectively) tags.

After the sequence labeling model receives the spliced text, the characters in the spliced text can be coded, and each character is marked with a label B, I, E, O, so that the common referents of the words can be extracted. The specific process is the prior art and is not described herein again.

For example, after the candidate text extractor encodes the sentence "chuanpu sounds the trade war between china and america" in the text, the result of tagging each character in the sentence with BIEO is "chuanbppu/E sounds/O in/O america/O between/O trade/O easy/O war/O", and thus "chuanpu" is labeled with "BE" tag, which is the start position and the end position of the answer, and no tag "O" appears in the middle, so that "chuanpu" is a legal co-referent; note that here, the process of extracting the co-referent also needs to determine the validity of the annotation. So-called legal labeling, i.e. characters between any pair of "B … … E" tags, no tags other than the "I" tag can be present, such as "BOE" tag and "BBE" tag, which are illegal. In other words, a legal annotation must satisfy the form of the "BI … … IE" tags, where the number of tags "I" is 0 or greater.

The process of extracting the common referent words from other sentences of the text is similar to the process of extracting the common referent words from the sentence "Chuanpu sounds the trade war between China and America", and is not repeated here.

In summary, the present embodiment provides a data processing method, where a question sentence is generated according to an obtained word, and a character capable of answering the question sentence is found out from the text as a common referent of the word, and compared with a method of performing common referent resolution based on tuple similarity comparison in the related art, the method can find out the character capable of answering the question sentence from the text as the common referent of the word through the question sentence generated by the word, and find out the common referent of the word from the text in a question-and-answer manner, thereby greatly improving accuracy of the common referent resolution.

Example 2

The present embodiment proposes a data processing apparatus for executing the data processing method.

Referring to a schematic structural diagram of a data processing apparatus shown in fig. 2, the present embodiment provides a data processing apparatus, including:

an obtaining module 200, configured to obtain a text and a word that needs to be coreferenced and resolved;

the processing module 202 is configured to generate a question sentence according to the word, and find out characters capable of answering the question sentence from the text as common referents of the word;

and the extraction module 204 is configured to extract the co-referent by using a candidate text extractor, and complete co-referent resolution of the terms.

Specifically, in order to find out the common referent of the word from the concatenated text, the processing module is specifically configured to:

acquiring a question template, filling the words into the question template, and generating question sentences related to the words;

splicing the question sentence with characters in the text to obtain a spliced text;

processing the spliced text by using a pre-training model BERT to obtain vector representation of each character in the spliced text;

and finding out characters capable of answering the question sentence from the spliced text as common referents of the words.

Specifically, the extracting module is configured to find out, from the concatenated text, a character that can answer the question sentence as a common referent of the term, and includes:

It can be seen from the above description that a question-answering framework based on a machine reading understanding model is used, question sentences are generated based on words needing coreference resolution, and the machine reading understanding model is enabled to find out characters capable of answering the question sentences from a spliced text by using the question sentences containing the words needing coreference resolution as coreference words of the words; the question-answering mechanism of natural language is skillfully used, and the coreference words of the words needing coreference resolution can be more accurately extracted from the text; moreover, the text and the problem sentence are processed by using a pre-training model and a machine reading understanding model at the front edge in the natural language processing, so that the accuracy of extracting the common-meaning word of the word needing common-meaning resolution from the text can be further improved, and the optimal effect is obtained.

In summary, according to the data processing apparatus provided in this embodiment, a question sentence is generated according to an obtained word, and a character capable of answering the question sentence is found out from the text as a common referent of the word, and compared with a method of performing common referent resolution based on tuple similarity comparison in the related art, the data processing apparatus can find out the character capable of answering the question sentence from the text as the common referent of the word through the question sentence generated by the word, and find out the common referent of the word from the text in a question-and-answer manner, thereby greatly improving accuracy of the common referent resolution.

Example 3

The present embodiment proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the data processing method described in embodiment 1 above. For specific implementation, refer to method embodiment 1, which is not described herein again.

In addition, referring to another schematic structural diagram of the data processing apparatus shown in fig. 3, the present embodiment further provides a data processing apparatus, which includes a bus 51, a processor 52, a transceiver 53, a bus interface 54, a memory 55, and a user interface 56. The data processing means comprise a memory 55.

In this embodiment, the data processing apparatus further includes: one or more programs stored on the memory 55 and executable on the processor 52, configured to be executed by the processor for performing the following steps (1) to (3):

(1) acquiring a text and words needing coreference resolution;

(2) generating a question sentence according to the words, and finding out characters capable of answering the question sentence from the text as common referents of the words;

(3) and extracting the co-reference words by using a candidate text extractor, and completing the co-reference resolution of the words.

A transceiver 53 for receiving and transmitting data under the control of the processor 52.

In fig. 3, a bus architecture (represented by bus 51), bus 51 may include any number of interconnected buses and bridges, with bus 51 linking together various circuits including one or more processors, represented by general purpose processor 52, and memory, represented by memory 55. The bus 51 may also link various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further in this embodiment. A bus interface 54 provides an interface between the bus 51 and the transceiver 53. The transceiver 53 may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. For example: the transceiver 53 receives external data from other devices. The transceiver 53 is used for transmitting data processed by the processor 52 to other devices. Depending on the nature of the computing system, a user interface 56, such as a keypad, display, speaker, microphone, joystick, may also be provided.

The processor 52 is responsible for managing the bus 51 and the usual processing, running a general-purpose operating system as described above. And memory 55 may be used to store data used by processor 52 in performing operations.

Alternatively, processor 52 may be, but is not limited to: a central processing unit, a singlechip, a microprocessor or a programmable logic device.

It will be appreciated that the memory 55 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data rate Synchronous Dynamic random access memory (ddr SDRAM ), Enhanced Synchronous SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 55 of the systems and methods described in this embodiment is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 55 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof: an operating system 551 and application programs 552.

The operating system 551 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 552 includes various applications, such as a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application 552.

In summary, according to the computer-readable storage medium and the data processing apparatus provided in this embodiment, a question sentence is generated according to an acquired word, and a character capable of answering the question sentence is found from the text as a common referent of the word, and compared with a method of performing common referent resolution based on tuple similarity comparison in the related art, the method can find the character capable of answering the question sentence from the text as the common referent of the word through the question sentence generated by the word, and find the common referent of the word from the text in a question-and-answer manner, thereby greatly improving accuracy of the common referent resolution.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data processing method, comprising:

acquiring a text and words needing coreference resolution;

2. The method of claim 1, wherein generating a question sentence from the words and finding characters from the text that can answer the question sentence as common referents to the words comprises:

3. The method of claim 2, wherein finding characters from the stitched text that can answer the question sentence as a common referent of the term comprises:

4. A data processing apparatus, comprising:

5. The apparatus according to claim 4, wherein the processing module is specifically configured to:

6. The apparatus of claim 5, wherein the extracting module is configured to find out characters capable of answering the question sentence from the concatenated text as common referents of the word, and includes:

7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1-3.

8. A data processing apparatus comprising a memory, a processor and one or more programs, wherein the one or more programs are stored in the memory and configured to cause the processor to perform the steps of the method of any of claims 1-3.