CN115688792A

CN115688792A - Problem generation method and device based on document and server

Info

Publication number: CN115688792A
Application number: CN202211182177.9A
Authority: CN
Inventors: 范晓东
Original assignee: Industrial and Commercial Bank of China Ltd ICBC; ICBC Technology Co Ltd
Current assignee: Industrial and Commercial Bank of China Ltd ICBC; ICBC Technology Co Ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-02-03

Abstract

The application provides a problem generation method, a problem generation device and a problem generation server based on a document, which relate to the document processing technology, and the method comprises the following steps: acquiring a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles. And according to the hierarchical relation of the multilevel titles in the document to be analyzed, performing semantic analysis processing on the document to be analyzed to obtain semantic information under each hierarchy. And generating a semantic association diagram of the document to be analyzed according to the semantic information under each level. And performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed. The method improves the semantic accuracy of the target problem information and solves the technical problem of poor semantic accuracy of the generated problem.

Description

Problem generation method and device based on document and server

Technical Field

The present application relates to document processing technologies, and in particular, to a method, an apparatus, and a server for generating a problem based on a document.

Background

At present, in order to better realize question-answer interaction, a question generation model needs to be obtained according to the corresponding relation between the documents and the questions of the documents.

In the prior art, a short document is usually trained based on a formulated problem template to obtain a problem generation model for predicting a document problem, and the short document is generated according to the problem generation model.

However, in the prior art, since the problem of a short document can only be generated according to the problem generation model, for a long answer, an accurate problem cannot be generated, and the semantic accuracy of the generated problem is poor.

Disclosure of Invention

The application provides a problem generation method and device based on a document and a server, which are used for solving the technical problem that the semantic accuracy of a generated problem is poor.

In a first aspect, the present application provides a method for generating a problem based on a document, including:

acquiring a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles;

according to the hierarchical relation of the multilevel titles in the document to be analyzed, carrying out semantic analysis processing on the document to be analyzed to obtain semantic information under each hierarchy;

generating a semantic association diagram of the document to be analyzed according to the semantic information under each level;

performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents.

Further, the semantic analysis processing is performed on the document to be analyzed according to the hierarchical relationship of the multilevel titles in the document to be analyzed to obtain semantic information at each hierarchical level, and the semantic information includes:

segmenting the document to be analyzed according to the hierarchical relation of the multilevel titles of the document to be analyzed to obtain a text block under each hierarchy; the text block comprises a title of a hierarchy to which the text block belongs and body content under the title;

and carrying out semantic analysis processing on the text block under each level to obtain semantic information of the text block under each level.

Further, the semantic parsing processing on the text block at each level to obtain semantic information of the text block at each level includes:

segmenting the title in the text block under each level, and determining a part-of-speech tag and a named entity tag of each obtained segmentation;

determining the part-of-speech tag as a participle of the noun and/or the named entity tag as a core word, and forming a title core word set;

filtering out a sentence set containing the core words in the text content;

carrying out semantic analysis processing on the sentences including the core words in the sentence set to obtain semantic information of the text blocks at each level; the semantic information comprises part-of-speech tags of the participles in the sentence, named entity tags, analyzed original texts containing the participles and dependency syntactic relations.

Further, the generating a semantic association diagram of the document to be analyzed according to the semantic information at each level includes:

according to semantic information under each level, determining other entity labels in a sentence where the core word is located, dependency participles which have dependency syntactic relation with the core word in the sentence where the core word is located, and other participles in the sentence where the core word is located in the text content under each level;

and taking all the core words of the title core word set as root nodes, performing directed connection according to the appearance sequence in the title, taking the named entity tag, the other entity tags and the dependency participles as primary nodes, taking the edge relation to the core words, taking other participles which have dependency syntactic relations with the named entity tag and/or the other entity tags and other participles which have dependency syntactic relations with the dependency participles as secondary nodes, and generating the semantic association graph of the document to be analyzed.

Further, the method further comprises:

acquiring a plurality of documents;

carrying out segmentation processing and semantic parsing processing on each document to obtain semantic information of the document; generating a semantic association diagram of the document according to the semantic information of the document;

determining question-answer pair data according to a preset question corresponding to each document; and performing question generation training on the initial model according to the question-answer pair data and the semantic association diagram to generate a question generation model.

In a second aspect, the present application provides a document-based question generation apparatus, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles;

the analysis unit is used for carrying out semantic analysis processing on the document to be analyzed according to the hierarchical relationship of the multilevel titles in the document to be analyzed to obtain semantic information under each hierarchy;

the first generation unit is used for generating a semantic association diagram of the document to be analyzed according to the semantic information under each level;

the prediction unit is used for performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents.

Further, the parsing unit includes:

the segmentation module is used for segmenting the document to be analyzed according to the hierarchical relation of the multilevel titles of the document to be analyzed to obtain a text block under each hierarchy; the text block comprises a title of a hierarchy to which the text block belongs and body content under the title;

and the analysis module is used for carrying out semantic analysis processing on the text block under each level to obtain semantic information of the text block under each level.

Further, the parsing module includes:

the word segmentation submodule is used for segmenting the titles in the text blocks under each level, and determining the part-of-speech tag and the named entity tag of each obtained word segmentation;

the determining submodule is used for determining that the part-of-speech tag is a participle of a noun and/or the named entity tag is a core word and forming a title core word set;

the filtering submodule is used for filtering out a sentence set containing the core words in the text content;

the parsing submodule is used for carrying out semantic parsing processing on the sentences containing the core words in the sentence set to obtain semantic information of the text blocks under each level; the semantic information comprises part-of-speech tags of participles in the sentence, named entity tags, analyzed original texts containing the participles and dependency syntactic relations.

Further, the first generating unit includes:

the determining module is used for determining other entity labels in the sentence where the core word is located, the dependency participles which have dependency syntactic relation with the core word in the sentence where the core word is located and other participles in the sentence where the core word is located in the text content under each level according to the semantic information under each level;

and the generating module is used for taking all the core words of the title core word set as root nodes and performing directed connection according to the appearance sequence in the title, taking the named entity tag, the other entity tags and the dependent participles as primary nodes, taking the edge relation as a core word, taking other participles which have a dependent syntactic relation with the named entity tag and/or the other entity tags and other participles which have a dependent syntactic relation with the dependent participles as secondary nodes, and generating the semantic association graph of the document to be analyzed.

Further, the apparatus further comprises:

a second acquisition unit configured to acquire a plurality of documents;

the second generation unit is used for carrying out segmentation processing and semantic analysis processing on each document to obtain semantic information of the document; generating a semantic association diagram of the document according to the semantic information of the document;

the training unit is used for determining question-answer pair data according to a preset question corresponding to each document; and performing question generation training on the initial model according to the question-answer pair data and the semantic association diagram to generate a question generation model.

In a third aspect, the present application provides a server, comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and the processor implements the method of the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the method of the first aspect when executed by a processor.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

The application provides a problem generation method, device and server based on documents, and the problem generation method, device and server are used for acquiring documents to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles. And according to the hierarchical relation of the multilevel titles in the document to be analyzed, performing semantic analysis processing on the document to be analyzed to obtain semantic information under each hierarchy. And generating a semantic association diagram of the document to be analyzed according to the semantic information under each level. Performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents. According to the technical scheme, according to the hierarchical relation of the multilevel titles in the document to be analyzed, semantic analysis processing is carried out on the document to be analyzed to obtain semantic information under each hierarchy, and a semantic association diagram of the document to be analyzed is generated according to the semantic information under each hierarchy. The problem generation model is obtained by training data according to a plurality of preset problem-answer pairs, and the target problem information of the document to be analyzed can be obtained by performing problem prediction processing on the semantic association diagram according to the preset problem generation model. Therefore, for a longer document, the document can be subjected to semantic analysis processing and predicted to generate target problem information through the hierarchical relation between the multistage titles of the document, so that full-text analysis of the longer text is realized, the semantic accuracy of the target problem information is improved, and the technical problem of poor semantic accuracy of the generated problem is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart illustrating a document-based question generation method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating another document-based question generation method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another method for generating a document-based question according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another method for generating a document-based question according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of another training problem generation model provided in the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a document-based question generation apparatus according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of another document-based question generation apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

The server has shown, by way of the foregoing drawings, specific embodiments of the present disclosure and will be described in greater detail hereinafter. The drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure.

In one example, a shorter document is typically trained based on a formulated problem template, resulting in a problem generation model for predicting the problem with the document, and the problem with the shorter document is generated from the problem generation model. However, in the prior art, since the problem of a short document can only be generated according to the problem generation model, for a long answer, an accurate problem cannot be generated, and the semantic accuracy of the generated problem is poor.

The application provides a problem generation method, a problem generation device and a problem generation server based on a document, and aims to solve the technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a document-based question generation method provided in an embodiment of the present application, and as shown in fig. 1, the method includes:

101. acquiring a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles.

The execution subject of the present embodiment may be a server, for example. The method comprises the steps of firstly obtaining a document to be analyzed, wherein the document to be analyzed comprises multiple stages of titles, hierarchical relations exist among the multiple stages of titles, text content exists under each stage of title, the text content is an answer of target question information to be generated, the text content under some titles can be empty, and therefore limitation is not conducted.

102, according to the hierarchical relation of the multilevel titles in the document to be analyzed, performing semantic analysis processing on the document to be analyzed to obtain semantic information under each hierarchy.

Illustratively, according to the hierarchical relationship of the multilevel titles in the document to be analyzed, the server may segment the document to be analyzed to obtain text blocks in each hierarchy, and perform semantic parsing on a plurality of text blocks in the document to be analyzed to obtain semantic information in each hierarchy.

For example, firstly, the document is input into a document format analysis module, taking doc-format document as an example, and taking a hierarchical title as a segmentation granularity, the document to be analyzed is segmented into n text blocks, if no text content exists between the two levels of titles, the part of the text blocks are filtered, and the text blocks are used as { D ₀ ，D ₁ ，...D _n And expressing that the text block Di comprises a title T of the hierarchy to which the text block belongs and text content S under the title, and the text block Di is used as an answer generated by the problem together, and filtering content in a document to be analyzed, wherein the content comprises rich text formats such as pictures and tables. And then performing semantic analysis processing on the title T and the text content S of the text block under each level respectively to obtain a title core word set comprising a plurality of core words under each level. In the text content, a sentence set containing core words is filtered, semantic analysis processing is carried out on the sentences containing the core words in the sentence set, and semantic information of text blocks is obtained, wherein the semantic information comprises part-of-speech tags of participles in the sentences, named entity tags, analyzed original texts containing the participles, dependency syntactic relations and the like.

And 103, generating a semantic association diagram of the document to be analyzed according to the semantic information under each level.

Illustratively, the core word Ki in the title core word set obtained after the title parsing is used as the root node in the semantic association graph. Carrying out node expansion and screening of the semantic association graph according to the following strategies:

1) Determining other entity labels of the core word Ki appearing in the screened sentences containing the core word;

2) Determining the dependency participles of the core words Ki which appear in the screened sentences containing the core words and have dependency syntactic relations with the core words;

3) The other participles in the screened sentence in which the core word Ki appears are determined.

Then, after the node elements are extracted from the related sentences in the document to be analyzed, the core words are used as root nodes, and edge relations are established between the core words and the root nodes in sequence, wherein the connection of the edges is divided into different levels according to semantic relevance, and the division rule is as follows:

(1) There is a direct edge association between multiple core words, and the direction of the edge is determined according to the order of appearance of the core words in the title, for example, the order of the edge is from back to front according to the order of appearance of the core words in the title. For example, the core word appearing in the primary topic is earlier than the core word appearing in the secondary topic, so the order of the edges is such that the core word appearing in the secondary topic points to the core word appearing in the primary topic.

(2) The first-level nodes directly connected with the root nodes comprise associated entities of core words Ki and dependency participles which have dependency syntactic relations with the core words, and the edge directions are pointed to the root nodes by the first-level nodes;

(3) The secondary nodes indirectly connected with the root nodes comprise other participles in the sentence with the dependency syntax relation with the related entity of the core word Ki and other participles in the sentence with the dependency syntax relation with the dependency participles, and the edge direction is determined by the syntax dependency relation.

Therefore, the construction of the semantic association graph G between the title and the answer is completed, namely, key semantic information related to the core subject is captured on the whole.

Step 104, carrying out problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents.

Illustratively, since the question generation model is obtained by training data according to a plurality of preset question-answers, the target question information of the document to be analyzed can be obtained by inputting the semantic association diagram into the preset question generation model and performing question prediction processing on the semantic association diagram according to the preset question generation model.

In the embodiment of the application, a document to be analyzed is obtained; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles. And according to the hierarchical relation of the multilevel titles in the document to be analyzed, performing semantic analysis processing on the document to be analyzed to obtain semantic information under each hierarchy. And generating a semantic association diagram of the document to be analyzed according to the semantic information under each level. Performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents. According to the technical scheme, according to the hierarchical relation of the multilevel titles in the document to be analyzed, semantic analysis processing is carried out on the document to be analyzed to obtain semantic information under each hierarchy, and a semantic association diagram of the document to be analyzed is generated according to the semantic information under each hierarchy. The problem generation model is obtained by training data according to a plurality of preset problem-answer pairs, and the target problem information of the document to be analyzed can be obtained by performing problem prediction processing on the semantic association diagram according to the preset problem generation model. Therefore, for a longer document, the document can be subjected to semantic analysis processing and predicted to generate target problem information through the hierarchical relation between the multistage titles of the document, so that full-text analysis of the longer text is realized, the semantic accuracy of the target problem information is improved, and the technical problem of poor semantic accuracy of the generated problem is solved.

Fig. 2 is a flowchart of another document-based question generation method provided in an embodiment of the present application, and as shown in fig. 2, the method includes:

step 201, obtaining a plurality of documents.

Illustratively, the server may acquire a plurality of documents in advance.

Step 202, carrying out segmentation processing and semantic parsing processing on each document to obtain semantic information of the document; and generating a semantic association diagram of the document according to the semantic information of the document.

For example, the server may segment each document to obtain a plurality of text blocks in each document, perform semantic analysis processing on the plurality of text blocks in each document to obtain semantic information of the document, and generate a semantic association diagram of the document according to the semantic information of the document. For specific segmentation and semantic parsing, refer to step 102 in fig. 1, a semantic association diagram of the document is generated, refer to step 103 in fig. 1, and are not described again.

Step 203, determining question-answer pair data according to a preset question corresponding to each document; and performing question generation training on the initial model according to the question-answer pair data and the semantic association diagram to generate a question generation model.

Illustratively, the semantic association diagram of the processed document is designed by manually performing a question based on a title, so as to obtain a preset question serving as training data, that is, a preset question corresponding to each text block in the document is determined, and question-answer pair data is determined according to the preset question corresponding to each text block in the document. Meanwhile, according to statistics, the titles of general description type document data have the characteristics of simplicity and clear target, and the corresponding problem types mainly comprise: what (what) and how (how) are two categories. Thus, labels of the question types are added to the questions of each design, and are respectively represented by 0 for the what type and 1 for the how type.

The question generation model is mainly used for accurately generating the description of the question corresponding to the answer and ensuring that the generated question can be answered by the answer, and has certain diversity and question value. Therefore, in order to ensure accuracy, the model mainly adopts a frame of Seq2Seq (referring to a technology related to deep learning) to encode and represent semantic association graphs established by titles and answers, and performs multi-task learning training by combining two tasks of prediction of question types and question generation. The training process is as follows:

1) Training phase

(1) Encoder for encoding a video signal

The encoder adopts a Gated Graph Sequence Neural network (GGNN) to perform fusion coding representation on the semantic association Graph, and the number of nodes in the semantic association Graph is set as n. The GGNN may be a Graph Attention Network (GAT), a Graph convolution neural Network (GCN), or the like, which is not limited thereto.

Firstly, the input of the model is a constructed semantic association graph G = { V, A }, wherein V represents a node set in the semantic association graph, A represents an edge relation adjacency matrix of the semantic association graph, and each node performs d-dimensional initialization vector representation V through a Glove static word vector _n (ii) a The Glove static word vector is a text semantic code representation obtained by training a neural network model on a large amount of text data, and can be replaced by a vector representation obtained by training a similar text semantic code model, such as word2vec, BERT and the like, without limitation.

Then, the node characteristics V _n Inputting the information into a GGNN network, performing information fusion coding between adjacent nodes according to the adjacency matrix relationship, wherein the node output by the network has the characteristic of V _n '；

Finally, the updated node characteristics V _n ' inputting to the full link layer and performing graph-level feature extraction fusion by max pooling (Max Pooling), converting n node feature vectors into a graph code vector V _g ，V _g And the core semantic information of the whole semantic association graph is represented and serves as the output of the encoder.

(2) Decoder

Two tasks are involved in the decoder, one is to perform problem generation prediction through Long Short-Term Memory network (LSTM) sequence decoding, and the other is to perform problem type prediction through problem generation prediction results. The LSTM may also be replaced by other types of time-series Recurrent Neural Network algorithms such as Recurrent Neural Network (RNN) and Recurrent Neural Network (GRU), which is not limited herein.

Task of problem generation, output V of encoder _g And initial decoded character of preset question<s>As input at time 0 of the decoder, and performs prediction of the first word in the preset question. The training process adopts a teacher training mode, so that at the second moment, V is set _g And the first word of the preset problem is used as the input of the second moment, the prediction of the second word in the problem is carried out, and the like. The maximum decoding length of the problem generation is set to be 20, and the problem generation prediction is automatically finished after 20 rounds of iteration.

And a problem type prediction task, wherein when a problem generation task is performed, the decoding vectors of the LSTM network in each time step are copied and stored, 20 rounds of problem generation prediction are completed, 20 decoding vectors are classified and predicted through a full connected layer and a Normalized exponential function (softmax), and the problem type is predicted and obtained.

(3) Multitask joint learning

During joint learning of two tasks of problem generation and problem type prediction in a decoding phase, cross entropy loss Lqgen and binary cross entropy loss Lqtype are respectively adopted as loss functions, the two loss functions are weighted and summed through setting a weight alpha, the two loss functions are used as loss functions of the whole network to perform model parameter optimization training, and finally a problem generation model is generated.

Therefore, a semantic association graph of a participle/entity level is constructed between the title and the long answer content through a heuristic method, a core semantic range to be asked is extracted, problem type prediction is carried out based on graph node part of speech statistics, and the problem is generated by inputting a coding mode of a joint variational self-coder based on a graph convolution network and joint semantic coding and problem type coding into an LSTM sequence decoder. Its advantages are as follows:

1. the semantic association graph of the titles and the answers is constructed based on heuristic rules, so that the core semantic information of the long answers can be extracted, and the influence of redundant information on the generation of the problems can be reduced.

2. The problem type is determined to be beneficial to restricting the angle of problem generation, the problem generation is avoided being the wrong problem that the answer cannot be solved, and the generation controllability is improved.

Therefore, semantic coding is carried out based on the graph neural network, dependency syntax information of the title related sentences in the answers is indirectly integrated, the auxiliary model learns the syntax format, coding information is enriched, and accuracy of problem generation is benefited.

Step 204, obtaining a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles.

For example, this step may refer to step 101 in fig. 1, and is not described again.

Step 205, segmenting the document to be analyzed according to the hierarchical relationship of the multilevel titles of the document to be analyzed to obtain a text block under each hierarchy; the text block comprises a title of a hierarchy to which the text block belongs and body content under the title.

Illustratively, the server may segment the document to be analyzed into n text blocks with the hierarchical headings as the segmentation granularity, and filter out the text blocks if there is no text content between the two hierarchical headings, where the text blocks are represented by { D ₀ ，D ₁ ，...D _n And expressing that the text block Di comprises a title T of the hierarchy to which the text block belongs and text content S under the title to which the text block belongs, and the text block Di is used as an answer generated by the question.

And step 206, performing semantic analysis processing on the text block under each level to obtain semantic information of the text block under each level.

In one example, the method comprises the steps of segmenting a title in a text block under each level, and determining a part-of-speech tag and a named entity tag of each obtained segmentation; determining participles with part-of-speech tags as nouns and/or named entity tags as core words, and forming a title core word set; filtering out a sentence set containing core words in the text content; carrying out semantic analysis processing on sentences containing core words in the sentence set to obtain semantic information of text blocks at each level; the semantic information comprises part-of-speech tags of participles in the sentence, named entity tags, analyzed original texts containing the participles and dependency syntactic relations.

Illustratively, the server mainly uses StanfordCoreNLP tool to perform lexical and syntactic parsing on the title T and chapter contents S in the text block D, wherein the StanfordCoreNLP tool is a natural language lexical syntax and named entity parsing tool, and may be replaced by other natural language lexical syntax and named entity parsing tools, which is not limited thereto. The title parsing is to perform word segmentation on the T, obtain a part-of-speech tag and/or a named entity tag of each word in the title, determine that the part-of-speech tag is a part of a noun and/or the named entity tag is a core word, and form a title core word set K, where the part-of-speech tag is a noun or a verb, and the named entity tag is a location name, a type to which the location name belongs, and the like, and is not limited to this. The answer analysis firstly filters the answer content S, extracts a plurality of sentence sets Sk which have direct relation with the title core words, namely sentences appearing in the core words in the title core word set K, and analyzes the parts of speech, named entities and dependency syntax, and each sentence is analyzed into a feature combination comprising the original text after word segmentation, part of speech tags, named entity tags and dependency syntax relation tags, namely the feature combination of the original text after word segmentation, the part of speech tags, the named entity tags and the dependency syntax relation tags is semantic information.

Step 207, according to the semantic information of each level, in the text content of each level, determining other entity tags in the sentence where the core word is located, the dependency participles in the sentence where the core word is located and having dependency syntactic relation with the core word, and other participles in the sentence where the core word is located.

Illustratively, the server may use the core word Ki in the title core word set obtained after the title parsing as the root node in the semantic association graph. Carrying out node expansion and screening of the semantic association graph according to the following strategies:

And step 208, taking all the core words of the title core word set as root nodes, performing directed connection according to the appearance sequence in the title, taking the named entity tag, other entity tags and dependent participles as primary nodes, taking the edge relation to the core words, taking other participles which have a dependency syntactic relation with the named entity tag and/or other entity tags and other participles which have a dependency syntactic relation with the dependent participles as secondary nodes, and generating a semantic association graph of the document to be analyzed.

For example, after extracting the node elements from the relevant sentences in the document to be analyzed, the server may take the core words as root nodes and establish edge relationships with the root nodes in sequence, where the connection of the edges is divided into different levels according to semantic relevance, and the division rule is as follows:

Step 209, performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain core semantic information of the semantic association diagram, and determining the core semantic information as target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents.

Illustratively, the problem generation model is obtained by training according to standard problem information corresponding to a plurality of documents and a plurality of documents, so in the prediction stage, a semantic association diagram is input into a preset problem generation model as the input of an encoder, and in the decoding stage, only a problem generation module is used for problem generation, so that the core semantic information of the semantic association diagram is obtained, and the core semantic information is determined as the target problem information of the document to be analyzed.

In the embodiment of the application, a plurality of documents are acquired. Carrying out segmentation processing and semantic analysis processing on each document to obtain semantic information of the document; and generating a semantic association diagram of the document according to the semantic information of the document. Determining question-answer pair data according to a preset question corresponding to each document; and performing question generation training on the initial model according to the question-answer pair data and the semantic association diagram to generate a question generation model. Acquiring a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles. Segmenting the document to be analyzed according to the hierarchical relation of the multilevel titles of the document to be analyzed to obtain a text block under each hierarchy; the text block comprises a title of a hierarchy to which the text block belongs and body content under the title. And carrying out semantic analysis processing on the text block under each level to obtain semantic information of the text block under each level. According to semantic information under each level, determining other entity labels in a sentence where a core word is located, dependency participles which have dependency syntactic relation with the core word in the sentence where the core word is located, and other participles in the sentence where the core word is located in text content under each level. Taking all the core words of the title core word set as root nodes, performing directed connection according to the appearance sequence in the title, taking the named entity tag, other entity tags and the dependency participles as primary nodes, taking the edge relation to the core words, taking other participles which have dependency syntactic relations with the named entity tag and/or other entity tags and other participles which have dependency syntactic relations with the dependency participles as secondary nodes, and generating a semantic association graph of the document to be analyzed. Performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain core semantic information of the semantic association diagram, and determining the core semantic information as target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents. Therefore, for a longer document, the document can be subjected to semantic analysis processing and predicted to generate target problem information through the hierarchical relation between the multistage titles of the document, so that full-text analysis of the longer text is realized, the semantic accuracy of the target problem information is improved, and the technical problem of poor semantic accuracy of the generated problem is solved.

Exemplarily, fig. 3 is a schematic flowchart of another document-based question generation method provided in an embodiment of the present application, and as shown in fig. 3, the flowchart includes performing document format parsing on a document D to obtain a text block set { D "of the document D ₁ ，D ₂ …D _n }，D _i ＝{T _i ，S _i In which T is _i Title, S, representing the ith text block _i The body content of the ith text block is represented. Then, semantic parsing (namely lexical and syntactic parsing and the like) is carried out on the title T and the chapter content S in the text block D by using the StanfordCoreNLP tool to obtain T _i Noun/entity in (1), S _i Entity/dependency syntax relationship in (1).

Exemplarily, fig. 4 is a schematic flowchart of another document-based question generation method provided in an embodiment of the present application, and as shown in fig. 4, the flowchart includes a process according to T _i Noun/entity in (1), S _i Entity/dependency syntax relationship in (1), based on heuristic graphsAnd (5) establishing a rule, and establishing a semantic association graph G.

Fig. 5 is a schematic flowchart of another training problem generation model provided by an embodiment of the present application, and as shown in fig. 5, the flowchart includes performing word vector transformation on a semantic association graph G, performing fusion coding on the semantic association graph through a gated graph sequence neural network GGNN, encoding vectors according to MaxPooling and a full-link layer output graph, sequence decoding, predicting a problem type, a cross entropy loss Lqgen, and a binary cross entropy loss Lqtype.

Fig. 6 is a schematic structural diagram of a document-based question generation apparatus provided in an embodiment of the present application, and as shown in fig. 6, the apparatus includes:

a first acquisition unit 31 for acquiring a document to be analyzed; the document to be analyzed comprises multiple levels of titles, and hierarchical relationships exist among the multiple levels of titles.

And the analysis unit 32 is configured to perform semantic analysis processing on the document to be analyzed according to the hierarchical relationship of the multilevel titles in the document to be analyzed, so as to obtain semantic information under each hierarchical level.

The first generating unit 33 is configured to generate a semantic association diagram of the document to be analyzed according to the semantic information at each level.

The prediction unit 34 is used for performing problem prediction on the semantic association diagram according to a preset problem generation model to obtain target problem information of the document to be analyzed; the question generation model is obtained by training data according to a plurality of preset question-answer pairs, and answers in the question-answer pair data are documents.

The apparatus of this embodiment may execute the technical solution in the method, and the specific implementation process and technical principle are the same, which are not described herein again.

Fig. 7 is a schematic structural diagram of another document-based question generation apparatus provided in an embodiment of the present application, and based on the embodiment shown in fig. 6, as shown in fig. 7, the parsing unit 32 includes:

the segmenting module 321 is configured to segment the document to be analyzed according to the hierarchical relationship of the multilevel titles of the document to be analyzed, so as to obtain a text block in each hierarchy; the text block comprises a title of a hierarchy to which the text block belongs and body content under the title.

The parsing module 322 is configured to perform semantic parsing on the text block at each level to obtain semantic information of the text block at each level.

In one example, parsing module 322 includes:

the word segmentation sub-module 3221 is configured to perform word segmentation on the titles in the text blocks at each level, and determine a part-of-speech tag and a named entity tag of each obtained word segmentation;

the determining sub-module 3222 is configured to determine that the part-of-speech tag is a participle of a noun and/or the named entity tag is a core word, and form a title core word set.

The filtering sub-module 3223 is configured to filter out a sentence set including the core word from the text content.

The parsing sub-module 3224 is configured to perform semantic parsing on the sentences including the core words in the sentence set to obtain semantic information of the text block at each level; the semantic information comprises part-of-speech tags of the participles in the sentence, named entity tags, analyzed original texts containing the participles and dependency syntactic relations.

In one example, the first generating unit 33 includes:

the determining module 331 is configured to determine, according to the semantic information at each level, in the text content at each level, other entity tags in the sentence where the core word is located, dependency participles in the sentence where the core word is located, which have a dependency syntactic relationship with the core word, and other participles in the sentence where the core word is located.

The generating module 332 is configured to use all the core words of the title core word set as root nodes and perform directional connection according to an appearance sequence in the title, use all the named entity tags, other entity tags, and dependent participles as first-level nodes, use the edge relation to point to the core word, use all other participles having a dependency syntactic relation with the named entity tags and/or other entity tags, and use all other participles having a dependency syntactic relation with the dependent participles as second-level nodes, and generate a semantic association graph of the document to be analyzed.

In one example, the apparatus further comprises:

a second acquiring unit 41 for acquiring a plurality of documents.

The second generating unit 42 is used for performing segmentation processing and semantic parsing processing on each document to obtain semantic information of the document; and generating a semantic association diagram of the document according to the semantic information of the document.

A training unit 43 for determining question-answer pair data according to a preset question corresponding to each document; and performing question generation training on the initial model according to the question-answer pair data and the semantic association diagram to generate a question generation model.

The apparatus of this embodiment may execute the technical solution in the method, and the specific implementation process and the technical principle are the same, which are not described herein again.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, and as shown in fig. 8, the server includes: memory 51, processor 52.

The memory 51 has stored therein a computer program that is executable on the processor 52.

The processor 52 is configured to perform the methods provided in the embodiments described above.

The server also comprises a receiver 53 and a transmitter 54. The receiver 53 is used for receiving commands and data transmitted from an external device, and the transmitter 54 is used for transmitting commands and data to an external device.

Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions of the storage medium, when executed by a processor of a server, enable the server to perform the method provided by the foregoing embodiments.

An embodiment of the present application further provides a computer program product, where the computer program product includes: a computer program, the computer program being stored in a readable storage medium, from which the computer program can be read by at least one processor of the server, execution of the computer program by the at least one processor causing the server to carry out the solution provided by any of the embodiments described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating a question based on a document, comprising:

2. The method according to claim 1, wherein the semantic parsing the document to be analyzed according to the hierarchical relationship of the multilevel titles in the document to be analyzed to obtain semantic information at each level comprises:

3. The method according to claim 2, wherein the semantic parsing the text block at each level to obtain semantic information of the text block at each level comprises:

segmenting the titles in the text blocks under each level, and determining the part of speech tag and the named entity tag of each obtained segmentation;

filtering out a sentence set containing the core words in the text content;

carrying out semantic analysis processing on the sentences including the core words in the sentence set to obtain semantic information of the text blocks at each level; the semantic information comprises part-of-speech tags of participles in the sentence, named entity tags, analyzed original texts containing the participles and dependency syntactic relations.

4. The method according to claim 3, wherein the generating a semantic association map of the document to be analyzed according to the semantic information at each level comprises:

5. The method according to any one of claims 1-4, further comprising:

acquiring a plurality of documents;

carrying out segmentation processing and semantic analysis processing on each document to obtain semantic information of the document; generating a semantic association diagram of the document according to the semantic information of the document;

6. A document-based question generation apparatus, comprising:

7. The apparatus of claim 6, wherein the parsing unit comprises:

8. The apparatus of claim 7, wherein the parsing module comprises:

9. The apparatus of claim 8, wherein the first generating unit comprises:

the determining module is used for determining other entity labels in the sentence where the core word is located, the dependency participles in the sentence where the core word is located and having dependency syntactic relation with the core word, and other participles in the sentence where the core word is located in the text content under each level according to the semantic information under each level;

and the generating module is used for taking all the core words of the title core word set as root nodes and carrying out directed connection according to the appearance sequence in the title, taking the named entity tag, the other entity tags and the dependency participles as primary nodes, taking the edge relation to the core words, taking other participles with dependency syntactic relations with the named entity tag and/or the other entity tags and other participles with dependency syntactic relations with the dependency participles as secondary nodes, and generating the semantic association graph of the document to be analyzed.

10. The apparatus according to any one of claims 6-9, further comprising:

a second acquisition unit configured to acquire a plurality of documents;

11. A server, characterized by comprising a memory, a processor, a computer program being stored in the memory and being executable on the processor, the processor implementing the method of any of the preceding claims 1-5 when executing the computer program.

12. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, perform the method of any one of claims 1-5.

13. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 5.