CN112287077A

CN112287077A - Statement extraction method and device for combining RPA and AI for document, storage medium and electronic equipment

Info

Publication number: CN112287077A
Application number: CN202011148016.9A
Authority: CN
Inventors: 段沛宸; 张海雷; 胡一川; 汪冠春
Original assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Current assignee: Beijing Benying Network Technology Co Ltd; Beijing Laiye Network Technology Co Ltd
Priority date: 2019-12-09
Filing date: 2020-10-23
Publication date: 2021-01-29

Abstract

The invention provides a sentence extraction method, a device, a storage medium and electronic equipment for combining RPA and AI, the method comprises the steps of carrying out Natural Language Processing (NLP) on a document to obtain an initial question sentence and an initial answer sentence from the content of the document, wherein the initial question sentence corresponds to the initial answer sentence, and the initial question sentence is generated according to the set question sentence in the content of the document; and respectively executing target processing on the initial question sentence and the initial answer sentence to obtain question-answer pairs and outputting the question-answer pairs for text recognition. By the method and the device, when the sentences are extracted, the limitation of the document structure is avoided, the standard question sentences and the standard answer sentences can be automatically extracted from the content of the document, the sentence extraction effect is improved, and the question-answer pairs can be effectively obtained in an auxiliary manner.

Description

Statement extraction method and device for combining RPA and AI for document, storage medium and electronic equipment

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for extracting a sentence, which combines RPA (robot Process Automation) and AI (Artificial Intelligence) for a document, a storage medium, and an electronic device.

Background

Robot Process Automation (RPA) is a Process task that simulates human operations on a computer through specific robot software and automatically executes according to rules.

Artificial Intelligence (AI) is a technical science that studies and develops theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence.

In the related art, in a Natural Language Processing (NLP) application scenario in the field of computer technology, a manual or semi-manual method is usually adopted to extract a title, a central sentence, and the like from a structured document, and then a manual rewriting method is adopted to obtain a standard question and answer in the document, so as to assist in obtaining a corresponding question and answer pair subsequently, which is used as a corpus in the field of intelligent question and answer.

In this way, the extraction of the sentences is easily limited by the document structure, and more manual auxiliary operations are required, so that the extraction efficiency of the sentences is not high, and the extraction effect is not good.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a sentence extraction method, a sentence extraction device, a storage medium and electronic equipment for combining an RPA and an AI (semantic analysis) of a document, which can avoid the limitation of a document structure when extracting sentences, realize the automatic extraction of standard question sentences and standard answer sentences from the content of the document, improve the sentence extraction effect and effectively assist in obtaining question-answer pairs.

In order to achieve the above object, an embodiment of the first aspect of the present invention provides a sentence extraction method combining an RPA and an AI for a document, including: performing Natural Language Processing (NLP) on a document to acquire an initial question sentence and an initial answer sentence from the content of the document, wherein the initial question sentence corresponds to the initial answer sentence, and the initial question sentence is generated according to a set question sentence in the content of the document; and respectively executing target processing on the initial question sentence and the initial answer sentence to obtain question-answer pairs and outputting the question-answer pairs for text recognition.

In the sentence extraction method combining the RPA and the AI for the document according to the embodiment of the first aspect of the present invention, the document is subjected to the natural language processing NLP to obtain the initial question sentence and the initial answer sentence from the content of the document, the initial question sentence corresponds to the initial answer sentence, the initial question sentence is generated according to the question setting sentence in the content of the document, and the target processing is respectively performed on the initial question sentence and the initial answer sentence, so as to obtain the question-answer pair and output the question-answer pair for text recognition.

In order to achieve the above object, a sentence extraction device combining RPA and AI for a document according to an embodiment of a second aspect of the present invention includes: an acquisition module, configured to perform Natural Language Processing (NLP) on a document to acquire an initial question and an initial answer from a content of the document, where the initial question corresponds to the initial answer, and the initial question is generated according to a question set in the content of the document; and the execution module is used for respectively executing target processing on the initial question sentence and the initial answer sentence so as to obtain question-answer pairs and output the question-answer pairs for text recognition.

The sentence extraction device combining the RPA and the AI for the document according to the embodiment of the second aspect of the present invention obtains the initial question sentence and the initial answer sentence from the content of the document by performing the natural language processing NLP on the document, where the initial question sentence corresponds to the initial answer sentence, and the initial question sentence is generated according to the question set in the content of the document, and performs the target processing on the initial question sentence and the initial answer sentence, respectively, so as to obtain and output a question-answer pair for text recognition.

To achieve the above object, a non-transitory computer-readable storage medium according to a third embodiment of the present invention is a non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a sentence extraction method for a document that combines an RPA and an AI, the method including: the embodiment of the first aspect of the invention provides a statement extraction method combining RPA and AI for a document.

The non-transitory computer readable storage medium provided in the embodiment of the third aspect of the present invention performs natural language processing NLP on a document to obtain an initial question and an initial answer from the content of the document, where the initial question corresponds to the initial answer, and the initial question is generated according to a question set in the content of the document, and performs target processing on the initial question and the initial answer, respectively, so as to obtain and output a question-answer pair for text recognition.

The fifth aspect of the present invention further provides an electronic device, which includes a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, and the processor and the memory are disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the electronic equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing: performing Natural Language Processing (NLP) on a document to acquire an initial question and an initial answer from the content of the document, wherein the initial question corresponds to the initial answer, and the initial question is generated according to a set question in the content of the document; and respectively executing target processing on the initial question sentence and the initial answer sentence to obtain question-answer pairs and outputting the question-answer pairs for text recognition.

The electronic device provided by the embodiment of the fifth aspect of the present invention obtains an initial question and an initial answer from the content of a document by performing natural language processing NLP on the document, where the initial question corresponds to the initial answer, the initial question is generated according to a set question in the content of the document, and target processing is performed on the initial question and the initial answer, respectively, so as to obtain a question-answer pair and output the question-answer pair for text recognition.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a sentence extraction method combining RPA and AI for a document according to an embodiment of the present invention;

FIG. 2 is a flow chart of a sentence extraction method combining RPA and AI for a document according to another embodiment of the present invention;

FIG. 3 is a flow chart of a sentence extraction method combining RPA and AI for a document according to another embodiment of the present invention;

FIG. 4 is a flow chart of a sentence extraction method combining RPA and AI for a document according to another embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a sentence extraction apparatus combining RPA and AI for document according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

In order to solve the technical problems that the extraction of sentences in the related technology is easily limited by the document structure, more manual auxiliary operations are needed, the extraction efficiency of the sentences is not high, and the extraction effect is not good, the embodiment of the invention provides a sentence extraction method for combining RPA and AI, performing Natural Language Processing (NLP) on a document to acquire an initial question and an initial answer sentence from the content of the document, the initial question corresponding to the initial answer sentence, the initial question being generated according to a set question in the content of the document and performing target processing on the initial question and the initial answer sentence respectively, therefore, question-answer pairs are obtained and output for text recognition, the limitation of a document structure can be avoided when sentences are extracted, the standard question sentences and the standard answer sentences can be automatically extracted from the content of the document, the sentence extraction effect is improved, and the question-answer pairs can be effectively obtained in an auxiliary mode.

In addition, the term extraction in the present invention refers to a sentence extraction process combining robot flow automation RPA and artificial intelligence AI, that is, the sentence extraction process is a sentence extraction process with full flow automation, and the sentence extraction process is also combined with artificial intelligence AI to realize automatic sentence extraction.

The invention can be applied in particular to Natural Language Processing (NLP) of artificial intelligence AI, Natural Language Processing (NLP), i.e. the field of computer science, artificial intelligence, linguistics concerning the interaction between computer and human (Natural) Language.

For example, the full-flow automatic sentence extraction process is based on the full-flow automatic sentence extraction process to realize that the full-flow automatic execution carries out natural language processing NLP on the document, and then, some network models in the natural language processing NLP of the artificial intelligence AI are combined to obtain an initial question sentence and an initial answer sentence from the content of the document, wherein the initial question sentence corresponds to the initial answer sentence, and the initial question sentence is generated according to a question sentence set in the content of the document; and respectively executing target processing on the initial question sentence and the initial answer sentence to obtain question-answer pairs and outputting the question-answer pairs for text recognition.

Fig. 1 is a flowchart illustrating a sentence extraction method combining RPA and AI for a document according to an embodiment of the present invention.

The present embodiment is exemplified in that the sentence extraction method for document combining RPA and AI is configured as a sentence extraction apparatus for document combining RPA and AI.

The method for extracting a sentence combining RPA and AI for a document in this embodiment may be configured in a sentence extracting apparatus combining RPA and AI for a document, and the sentence extracting apparatus combining RPA and AI for a document may be set in a server, or may also be set in an electronic device, which is not limited in this embodiment of the present invention.

The present embodiment takes as an example that a sentence extraction method for a document combining RPA and AI is configured in an electronic device.

Among them, electronic devices such as smart phones, tablet computers, personal digital assistants, electronic books, and other hardware devices having various operating systems.

The execution main body of the embodiment of the present invention may be, for example, a Central Processing Unit (CPU) in the electronic device in terms of hardware, and may be, for example, a Natural Language Processing (NLP) related service in the electronic device in terms of software, which is not limited to this.

Referring to fig. 1, the method includes:

s101: and performing Natural Language Processing (NLP) on the document to acquire an initial question and an initial answer from the content of the document, wherein the initial question corresponds to the initial answer, and the initial question is generated according to a set question in the content of the document.

Optionally, the document is an unstructured document.

An unstructured document refers to a document with an irregular or incomplete document content structure and no predefined content model, and correspondingly, a structured document refers to a document with a regular or complete document content structure and a predefined content model, and the structured document and the unstructured document in this document may be folders, files, or segments in files.

The embodiment of the invention supports the processing of structured documents and unstructured documents.

Compared with the method of extracting the title, the central sentence and the like from the structured document by adopting a manual or semi-manual mode in the related art, in the embodiment, the electronic equipment is directly adopted to acquire the initial question sentence and the initial answer sentence from the content of the document, the initial question sentence corresponds to the initial answer sentence, and the initial question sentence is generated according to the set question sentence in the content of the document, so that the assistance of manual assistance is reduced, the automatic question and answer extraction can be realized, and the document processing is more efficient.

The initial question and the initial answer sentence described above may be used to generate a standard question and a standard answer sentence for obtaining a question-answer pair, the standard question may be referred to as a first question, the standard answer sentence may be referred to as a first answer sentence, and an embodiment of obtaining a question-answer pair from the first question and the first answer sentence may be referred to below.

In some embodiments, all question setting sentences of the content of the document may be extracted, and all question setting sentences may be directly used as initial question sentences, and the answer sentence corresponding to each question setting sentence may be used as an initial answer sentence.

In the embodiment of the present invention, referring to fig. 2, the step of obtaining the initial question sentence and the initial answer sentence from the content of the document may specifically include:

s201: and acquiring all question setting sentences from the content of the document, and determining the next sentence of each question setting sentence as the answer sentence corresponding to each question setting sentence.

In the specific execution process, all question sentences are acquired from the content of the document, and may be obtained by performing clause processing on the content of the document, identifying all question sentences from the content of the document after clause processing, and eliminating all question sentences in all question sentences.

The above-mentioned removing of all question statements may be removing question statements from all question statements according to keywords of the question statements, or may also be identifying characteristics of the question statements, so as to remove question statements from all question statements according to the characteristics of the question statements, which is not limited to this.

By removing the question reversing sentences from the question sentences according to the keywords of the question reversing sentences, the efficiency of sentence acquisition can be effectively improved.

In the specific implementation process, the above-mentioned obtaining all question sentences from the content of the document may perform clause processing on the content of the document, that is, firstly, all sentences in the content of the document are identified by the features (features such as punctuation, line feed, etc.) of sentences (question sentences, question reversals, statement sentences, exclamation sentences, etc.), or alternatively, all sentences in the content of the document are identified by adopting a pattern matching method, and the content of the document is subjected to clause processing by taking sentences as units, and then all question sentences are identified from the document content after clause processing (all question sentences include, for example, question sentences and question reversals), and question reversals are identified from all question sentences, and question reversals in all question sentences are removed, so that the rest of question sentences are used as the obtained question sentences.

The above-mentioned identifying all question sentences from the document content after sentence segmentation processing may specifically be to detect the characteristics of the question sentences carried in the sentences (the characteristics of the question sentences include, for example, what, how, or which characteristics are included).

In the above-mentioned recognition of all question sentences from the document contents after sentence division processing, it may be specifically recognized by using a pattern matching method what is the first sentence and what is the first sentence among all the sentences as question sentences.

The question with the keywords of the question can be identified from all the questions as the question.

The keywords of the question-reversing sentence are, for example, difficult to find, how to find, etc., and this is not a limitation.

S202: and combining the continuous question sentences in all the question sentences, taking the combined question sentences and other question sentences as initial question sentences, and combining the continuous question sentences and other question sentences to form all the question sentences.

S203: and merging the answer sentences corresponding to all the question sentences in the continuous question sentences, and taking the answer sentences after merging and the answer sentences corresponding to other question sentences as initial answer sentences.

After all the question sentences are obtained from the content of the document, the next sentence of each question sentence is determined to be the answer sentence corresponding to each question sentence, then the continuous question sentences are identified from all the question sentences, so that the continuous question sentences are combined, the combined question sentences and other question sentences (the other question sentences are discontinuous) are used as initial question sentences, and the continuous question sentences and other question sentences jointly form all the question sentences.

The continuous question is merged, that is, a plurality of questions are merged continuously, so that a plurality of continuous questions are merged into one question.

For example, a continuous question is, for example, who is the author of (r)? ② from which country? After the merging process: who is the author of this article? From which country?

And then, the combined question and other questions are used as initial questions.

Correspondingly, the answer sentences corresponding to the question sentences are processed in the same way to obtain initial answer sentences.

The method has the advantages that the Chinese language mode of setting the question sentence in the document is well analyzed in the sentence extracting process, the reference value of the obtained sentence is effectively improved, the question sentence and the corresponding answer sentence are combined, the execution efficiency of the method can be improved on the whole, and the use experience of a user is improved.

S102: and respectively executing target processing on the initial question sentence and the initial answer sentence to obtain question-answer pairs and outputting the question-answer pairs for text recognition.

In some embodiments, any possible method may be adopted to perform corresponding processing on the initial question and the initial answer sentence, so as to obtain a first question and a first answer sentence, and then obtain a question-answer pair by using the first question and the first answer sentence, for example, the initial question and the initial answer sentence may be respectively put in and put out of the neural network model, and the output of the neural network model may be used as the first question and the first answer sentence, where the neural network model has been learned to obtain the initial question and the initial answer sentence, and the corresponding relationship between the corresponding first question and the first answer sentence.

Of course, the neural network model is only one possible implementation manner for obtaining the first question and the first answer, and in the actual execution process, the first question and the first answer may be obtained in any other possible manner, for example, the neural network model may also be implemented by using a conventional programming technique (such as a simulation method and an engineering method), or may also be implemented by using a genetic algorithm.

In the embodiment of the invention, a first sentence vector corresponding to the initial question sentence and a second sentence vector corresponding to the initial answer sentence can be determined; clustering the first sentence vectors to obtain a question cluster corresponding to the initial question, and clustering the second sentence vectors to obtain an answer sentence cluster corresponding to the initial answer sentence; determining a first question from the question cluster and determining a first answer from the answer cluster; based on the first question and the first answer, question-answer pairs are obtained and output, the occupation ratio of manual auxiliary labeling in the sentence extraction and mining process is effectively reduced, the sentence extraction efficiency is improved, and a better sentence mining effect can be achieved by adopting a sentence vector analysis and clustering mode.

The determining of the first sentence vector corresponding to the initial question and the determining of the second sentence vector corresponding to the initial answer may be performed by using a pre-trained word vector and a Smooth Inverse word Frequency (SIF) algorithm to encode all the initial question and all the initial answer as sentence vectors, so as to determine the sentence vector corresponding to the initial question and serve as the first sentence vector, and determine the sentence vector corresponding to the initial answer and serve as the second sentence vector, where the pre-trained word vector may be obtained by training texts in the same field through a word2vec algorithm, and if the text amount is small (e.g., less than 100M), the pre-trained word vector training result of other open sources may be directly used to encode all the initial question and all the initial answer as sentence vectors, which is not limited.

When the first sentence vector is clustered to obtain the question cluster corresponding to the initial question and the second sentence vector is clustered to obtain the answer sentence cluster corresponding to the initial answer, the first sentence vector may be clustered by using the hdbscan clustering algorithm to obtain the question cluster corresponding to the initial question, and similarly, the second sentence vector may be clustered by using the hdbscan clustering algorithm to obtain the answer sentence cluster corresponding to the initial answer, which is not limited to this.

In the specific execution process, when a first question is determined from a question cluster and a first answer is determined from an answer cluster, a first clustering center in the question cluster can be determined, and a second clustering center in the answer cluster can be determined; traversing all first sentence vectors in the question cluster, determining a target first sentence vector closest to the cosine distance of the first cluster center, and taking an initial question corresponding to the target first sentence vector as a first question; traversing all second sentence vectors in the answer sentence cluster, determining a target second sentence vector closest to the cosine distance of the second cluster center, and taking an initial answer sentence corresponding to the target second sentence vector as a first answer sentence.

The determining of the first question from the question cluster may be, after clustering the first sentence vectors to obtain a question cluster corresponding to the initial question, averaging all the first sentence vectors in the question cluster to obtain a clustering center, traversing all the first sentence vectors in the question cluster, finding the first sentence vector closest to a cosine distance (cosine similarity) from the clustering center, and taking the initial question corresponding to the first sentence vector as the first question, which is not limited herein.

When the first answer sentence is determined from the answer sentence cluster, the second sentence vectors may be clustered to obtain the answer sentence cluster corresponding to the initial answer sentence, then all the second sentence directions in the answer sentence cluster are averaged to obtain the clustering center, then all the second sentence vectors in the answer sentence cluster are traversed to find the second sentence vector closest to the cosine distance (cosine similarity) from the clustering center, and the initial answer sentence corresponding to the second sentence vector is used as the first answer sentence, which is not limited.

In the embodiment, the initial question and the initial answer are obtained from the content of the document by performing Natural Language Processing (NLP) on the document, the initial question corresponds to the initial answer, the initial question is generated according to the set question in the content of the document, and target processing is respectively performed on the initial question and the initial answer, so that a question-answer pair is obtained and output for text recognition, the sentence extraction can be prevented from being limited by the structure of the document when the sentence is extracted, the standard question and the standard answer can be automatically extracted from the content of the document, the sentence extraction effect is improved, and the acquisition of the question-answer pair is effectively assisted.

In order to solve the technical problems that more manual auxiliary operations are needed, the obtaining efficiency of question-answer pairs is not high, and the obtaining effect is not good in the related technology, the method can also execute the following steps in the embodiment shown in fig. 3, and can be used for carrying out abbreviation processing on the answer sentences based on the question sentences, so that the problem that the answer sentences are too long and redundant information is too much is avoided, and the generated question-answer pairs have better user experience when the dialogue robot carries out dialogue reply by using the question-answer pairs.

Fig. 3 is a flowchart illustrating a sentence extraction method combining RPA and AI for a document according to an embodiment of the present invention.

Referring to fig. 3, the method includes:

s301: a target distance between a first question and a first answer is determined, the first question corresponding to the first answer.

The first question and the first answer sentence are standard question and standard answer sentences which can directly identify and extract question-answer pairs, the first question and the first answer sentence are corresponding, namely, the first answer sentence is an answer sentence of the first question, and the first answer sentence comprises an answer corresponding to the first question.

The first question sentence may be, for example: who is the parent of artificial intelligence?

The first answer sentence may be, for example: the father of the artificial intelligence is Allen Turing in the United kingdom.

Compared with the prior art that the query-answer pairs are obtained by adopting standard query sentences and assisting manual labeling and screening as the linguistic data in the field of intelligent query-answer, the embodiment of the invention directly and automatically determines the target distance between the first query sentence and the first query sentence by the electronic equipment to assist the subsequent automatic acquisition of the query-answer pairs, thereby improving the acquisition efficiency of the query-answer pairs.

In the specific implementation process, the electronic device may analyze the content of the document in advance, automatically extract a standard question from the content and use the standard question as a first question, extract a standard answer as a first answer, and then determine the target distance between the first question and the first answer.

The target distance may be, for example, an edit distance between the first question sentence and the first answer sentence.

The edit distance is a quantitative measure of the difference between two strings (e.g., english text), and the measure means how many times a string is changed into another string.

The first question sentence is: who is the parent of artificial intelligence? And the first answer sentence is: the father of the artificial intelligence is exemplified by allen-turing in uk, and if the edit distance between the father and the allen-turing is 9, the edit distance can be 9 as the target distance between the first question and the first answer, which is not limited.

In a specific implementation process, an NLP-related algorithm may be adopted to parse the semantics of the first question and the first answer, so as to determine an edit distance between the first question and the first answer, and use the edit distance as a target distance.

S302: and according to the target distance, carrying out abbreviation processing on the first answer sentence to obtain a target answer sentence.

In some embodiments, a corresponding abbreviation manner may be determined according to the target distance, the first answer sentence may be abbreviated by using the corresponding abbreviation manner, the first answer sentence after the abbreviation processing is used as the target answer sentence, for example, the feature of the first question sentence is identified, the feature of the first answer sentence is identified, the feature of the first question sentence and the feature of the first answer sentence are input into a preset model in combination with the target distance, and the corresponding abbreviation manner is determined according to the output of the preset model (the corresponding abbreviation manner is, for example, a set number of characters are deleted from the first answer sentence, and the like, which is not limited thereto).

In the embodiment of the invention, the length values of the first question and the first answer sentence can be determined, the proportional value between the target distance and the length value is determined, the proportional value is compared with the set threshold value, and the first answer sentence is abbreviated according to the comparison result to obtain the target answer sentence.

In the embodiment of the invention, when the proportion value is smaller than the set threshold value, the first question sentence is subjected to abbreviation processing to obtain the target answer sentence, when the proportion value is smaller than the set threshold value, the repeated text content between the first question sentence and the first answer sentence is more, at the moment, the first answer sentence can be directly subjected to abbreviation processing to obtain the target answer sentence, and the text characteristics between the question sentence and the answer sentence in practical application are effectively matched, so that the accurate question-answer pair is ensured to be obtained.

The set threshold may be preset, or may be dynamically adjusted in the actual application process, and specifically may be preset by a factory program of the electronic device, or may be set by a user according to an actual use requirement, which is not limited thereto.

The set threshold may be, for example, 0.5.

The first question sentence is: who is the parent of artificial intelligence? And the first answer sentence is: the father of the artificial intelligence is an example of allen-turing in the uk, the length value of the first question is 9, the length value of the first answer is 16, the total length values of the first question and the first answer are 25, and the target distance is 9, so that the ratio value between the target distance and the length value is 9/25-0.36, the 0.36 and the 0.5 are compared, and since the 0.36 is less than 0.5, the first answer can be subjected to abbreviation processing to obtain the target answer, for example, the first answer is subjected to abbreviation processing to obtain the allen-turing in the uk. ".

S303: and obtaining question-answer pairs according to the first question sentences and the target answer sentences.

Since the first question and the first answer sentence are the standard question and the standard answer sentence which can directly identify and extract the question-answer pair, after the first answer sentence is subjected to abbreviation processing according to the target distance to obtain the target answer sentence, the first question and the target answer sentence can be directly used as the question-answer pair [ is the father of artificial intelligence? Allen Turing in the UK. Or, the first question sentence may be subjected to corresponding abbreviation processing, and the first question sentence and the target answer sentence subjected to abbreviation processing are taken as a question-answer pair [ who? Allen Turing in the UK. Will be described below.

Optionally, in some embodiments, referring to fig. 4, the step of obtaining the target answer by performing abbreviation processing on the first answer may further include:

s401: the longest common substring between the first question sentence and the first answer sentence is determined.

The first question sentence is: who is the parent of artificial intelligence? And the first answer sentence is: the parent of the artificial intelligence is exemplified by allen-turing in uk, and the longest common substring between the first question and the first answer is the "parent of the artificial intelligence is".

S402: and deleting the longest common substring in the first answer sentence, so as to shorten the first answer sentence to obtain the target answer sentence.

In a specific execution process, the longest common substring in the first answer sentence can be deleted, so that the first answer sentence is subjected to abbreviation processing to obtain a target answer sentence, namely the longest common substring in the first answer sentence is deleted, namely the artificial intelligence father is, and the target answer sentence is 'allen-Turing in England'. "

The longest common substring between the first question and the first answer is determined, and the longest common substring in the first answer is deleted, so that the first answer is abbreviated to obtain the target answer, and the first question and the target answer can be efficiently used for obtaining the finally needed question-answer pair, so that the occupation ratio of manual auxiliary labeling in the excavation process of the question-answer pair can be remarkably reduced, and the efficiency of sentence extraction combining RPA and AI for a document is improved.

In this embodiment, by determining a target distance between a first question and a first answer, where the first question corresponds to the first answer, and according to the target distance, abbreviating the first answer to obtain a target answer, and obtaining a question-answer pair according to the first question and the target answer, it is possible to automatically obtain a corresponding question-answer pair according to a question in a document, improve the obtaining efficiency of the question-answer pair, and improve a sentence extraction effect for the document that combines RPA and AI, and at the same time, abbreviate the answer based on the question, so as to avoid that the answer is too long and redundant information is too much, and thus, the generated question-answer pair is better in user experience when a dialog robot uses the question-answer pair to perform dialog reply.

Fig. 5 is a schematic structural diagram of a sentence extraction apparatus combining RPA and AI for a document according to an embodiment of the present invention.

Referring to fig. 5, the apparatus 500 includes:

an obtaining module 501, configured to perform natural language processing NLP on a document to obtain an initial question and an initial answer from the content of the document, where the initial question corresponds to the initial answer, and the initial question is generated according to a set question in the content of the document.

And the executing module 502 is configured to execute target processing on the initial question sentence and the initial answer sentence respectively, so as to obtain question-answer pairs and output the question-answer pairs for text recognition.

Optionally, in some embodiments, the obtaining module 501 is specifically configured to:

all question setting sentences are obtained from the content of the document, and the next sentence of each question setting sentence is determined as each question

And setting an answer sentence corresponding to the question sentence.

Optionally, in some embodiments, the obtaining module 501 is further specifically configured to:

combining the continuous question setting sentences in all the question setting sentences, taking the combined question setting sentences and other question setting sentences as initial question sentences, and combining the continuous question setting sentences and other question setting sentences to form all the question setting sentences;

and merging the answer sentences corresponding to all the question sentences in the continuous question sentences, and taking the answer sentences after merging and the answer sentences corresponding to other question sentences as initial answer sentences.

Optionally, in some embodiments, the execution module 502 is specifically configured to:

determining a first sentence vector corresponding to the initial question sentence and a second sentence vector corresponding to the initial answer sentence;

clustering the first sentence vectors to obtain a question cluster corresponding to the initial question, and clustering the second sentence vectors to obtain an answer sentence cluster corresponding to the initial answer sentence;

determining a first question from the question cluster and determining a first answer from the answer cluster;

and obtaining question-answer pairs and outputting the question-answer pairs based on the first question sentences and the first answer sentences.

determining a first clustering center in a question cluster and determining a second clustering center in an answer cluster;

traversing all first sentence vectors in the question cluster, determining a target first sentence vector closest to the cosine distance of the first cluster center, and taking an initial question corresponding to the target first sentence vector as a first question;

traversing all second sentence vectors in the answer sentence cluster, determining a target second sentence vector closest to the cosine distance of the second cluster center, and taking an initial answer sentence corresponding to the target second sentence vector as a first answer sentence.

and clustering the first sentence vectors by adopting an hdbscan clustering algorithm, and clustering the second sentence vectors by adopting an hdbscan clustering algorithm.

sentence dividing processing is carried out on the content of the document;

identifying all question sentences from the content of the document after sentence segmentation processing;

and eliminating all question statements in all question statements so as to obtain all question statements.

and removing the question reversals from the question sentences according to the keywords of the question reversals.

determining a target distance between a first question and a first answer, wherein the first question corresponds to the first answer;

according to the target distance, carrying out abbreviation processing on the first answer sentence to obtain a target answer sentence;

and obtaining question-answer pairs according to the first question sentences and the target answer sentences.

determining the length values of the first question sentence and the first answer sentence;

determining a proportional value between the target distance and the length value;

comparing the proportional value with a set threshold value;

and according to the comparison result, carrying out abbreviation processing on the first answer sentence to obtain a target answer sentence.

Optionally, in some embodiments, the execution module 502 is further specifically configured to:

and when the proportion value is smaller than a set threshold value, carrying out abbreviation processing on the first answer sentence to obtain a target answer sentence.

determining the longest common substring between the first question sentence and the first answer sentence;

and deleting the longest common substring in the first answer sentence, so as to shorten the first answer sentence to obtain the target answer sentence.

Optionally, in some embodiments, the document is an unstructured document.

It should be noted that the explanation of the embodiment of fig. 1 to 4 of the sentence extraction method combining RPA and AI for a document also applies to the sentence extraction apparatus 500 combining RPA and AI for a document of this embodiment, and the implementation principle is similar, and is not described herein again.

The electronic device can be a mobile phone, a tablet computer and the like.

Referring to fig. 6, the electronic device 60 of the present embodiment includes: the device comprises a shell 601, a processor 602, a memory 603, a circuit board 604 and a power supply circuit 605, wherein the circuit board 604 is arranged in a space surrounded by the shell 601, and the processor 602 and the memory 603 are arranged on the circuit board 604; a power supply circuit 605 for supplying power to each circuit or device of the electronic apparatus 60; the memory 603 is used for storing executable program code; wherein the processor 602 runs a program corresponding to the executable program code by reading the executable program code stored in the memory 603, for performing:

performing Natural Language Processing (NLP) on a document to acquire an initial question and an initial answer from the content of the document, wherein the initial question corresponds to the initial answer, and the initial question is generated according to a set question in the content of the document;

and respectively executing target processing on the initial question sentence and the initial answer sentence to obtain question-answer pairs and outputting the question-answer pairs for text recognition.

It should be noted that the explanation of the embodiment of the sentence extraction method combining RPA and AI for a document in the foregoing fig. 1-4 is also applicable to the electronic device 60 of this embodiment, and the implementation principle is similar, and is not described herein again.

The computer device in this embodiment performs natural language processing NLP on a document to obtain an initial question and an initial answer from the content of the document, where the initial question corresponds to the initial answer, the initial question is generated according to a set question in the content of the document, and target processing is performed on the initial question and the initial answer, respectively, so as to obtain a question-answer pair and output the question-answer pair for text recognition.

In order to implement the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium, which when instructions in the storage medium are executed by a processor of a terminal, enables the terminal to perform a sentence extraction method combining an RPA and an AI for a document, the method including:

The non-transitory computer-readable storage medium in this embodiment performs natural language processing NLP on a document to obtain an initial question and an initial answer from the content of the document, where the initial question corresponds to the initial answer, the initial question is generated according to a set question in the content of the document, and target processing is performed on the initial question and the initial answer, respectively, so as to obtain a question-answer pair and output the question-answer pair for text recognition.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A sentence extraction method for a document combining RPA and AI, the method comprising:

performing Natural Language Processing (NLP) on a document to acquire an initial question sentence and an initial answer sentence from the content of the document, wherein the initial question sentence corresponds to the initial answer sentence, and the initial question sentence is generated according to a set question sentence in the content of the document;

2. The RPA and AI-combined sentence extraction method for a document according to claim 1, wherein the obtaining of the initial question sentence and the initial answer sentence from the content of the document comprises:

and acquiring all question setting sentences from the content of the document, and determining the next sentence of each question setting sentence as an answer sentence corresponding to each question setting sentence.

3. The RPA and AI-combined sentence extraction method for a document according to claim 2, wherein the obtaining of the initial question sentence and the initial answer sentence from the content of the document further comprises:

combining the continuous question setting sentences in all question setting sentences, taking the combined question setting sentences and other question setting sentences as the initial question sentences, and combining the continuous question setting sentences and the other question setting sentences to form all the question setting sentences;

and merging the answer sentences corresponding to all the question sentences in the continuous question sentences, and taking the merged answer sentences and the answer sentences corresponding to other question sentences as the initial answer sentences.

4. The RPA and AI-combined sentence extraction method for a document according to claim 1, wherein the performing of the target processing on the initial question sentence and the initial answer sentence, respectively, to thereby obtain a question-answer pair and outputting it comprises:

clustering the first sentence vectors to obtain a question cluster corresponding to the initial question, and clustering the second sentence vectors to obtain an answer cluster corresponding to the initial answer;

determining the first question from the question cluster and determining the first answer from the answer cluster;

and obtaining and outputting the question-answer pair based on the first question sentence and the first answer sentence.

5. The RPA and AI-combined sentence extraction method for a document according to claim 4, wherein the determining the first question sentence from the question cluster and the first answer sentence from the answer cluster comprises:

determining a first clustering center in the question cluster and determining a second clustering center of the answer cluster;

traversing all first sentence vectors in the question cluster, determining a target first sentence vector closest to the cosine distance of the first cluster center, and taking an initial question corresponding to the target first sentence vector as the first question;

traversing all second sentence vectors in the answer sentence cluster, determining a target second sentence vector closest to the cosine distance of the second cluster center, and taking an initial answer sentence corresponding to the target second sentence vector as the first answer sentence.

6. The RPA and AI-combined statement extraction method for documents as claimed in claim 4, wherein,

and clustering the first sentence vectors by adopting an hdbscan clustering algorithm, and clustering the second sentence vectors by adopting the hdbscan clustering algorithm.

7. The RPA and AI-combined sentence extraction method for a document according to claim 2, wherein the obtaining of all of the questioning sentences from the content of the document comprises:

sentence dividing processing is carried out on the content of the document;

and eliminating the question reversals in all the question sentences so as to obtain all the question setting sentences.

8. The RPA and AI-combined sentence extraction method for a document according to claim 7, wherein the eliminating of the question sentences of the all question sentences comprises:

9. The RPA and AI-combined sentence extraction method for a document according to claim 4, wherein the obtaining and outputting of the question-answer pair based on the first question sentence and the first answer sentence comprises:

and acquiring question-answer pairs according to the first question sentences and the target answer sentences.

10. The RPA and AI-combined sentence extraction method for a document according to claim 9, wherein the obtaining of the target answer sentence by abbreviating the first answer sentence according to the target distance comprises:

determining a ratio value between the target distance and the length value;

comparing the proportional value with a set threshold value;

11. The RPA and AI-combined sentence extraction method for a document according to claim 10, wherein the obtaining of the target answer sentence by abbreviating the first answer sentence according to the result of the comparison comprises:

and when the proportion value is smaller than the set threshold value, carrying out abbreviation processing on the first answer sentence to obtain a target answer sentence.

12. The RPA and AI-combined sentence extraction method for a document according to claim 10 or 11, wherein the abbreviating the first answer to obtain a target answer comprises:

and deleting the longest common substring in the first answer sentence, so as to shorten the first answer sentence to obtain a target answer sentence.

13. The RPA and AI-combined sentence extraction method for a document according to any of the claims 1-12, characterized in that the document is an unstructured document.

14. A combined RPA and AI sentence extraction apparatus for a document, the apparatus comprising:

an acquisition module, configured to perform Natural Language Processing (NLP) on a document to acquire an initial question and an initial answer from a content of the document, where the initial question corresponds to the initial answer, and the initial question is generated according to a question set in the content of the document;

and the execution module is used for respectively executing target processing on the initial question sentence and the initial answer sentence so as to obtain question-answer pairs and output the question-answer pairs for text recognition.

15. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the RPA and AI-combined sentence extraction method for a document according to any of claims 1-13.

16. An electronic device comprising a housing, a processor, a memory, a circuit board, and a power circuit, wherein the circuit board is disposed inside a space enclosed by the housing, the processor and the memory being disposed on the circuit board; the power supply circuit is used for supplying power to each circuit or device of the electronic equipment; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing: