CN113705208B

CN113705208B - Automatic Chinese problem generation method and device based on field terms and key sentences

Info

Publication number: CN113705208B
Application number: CN202111019721.3A
Authority: CN
Inventors: 赵军; 董勤伟; 查显光; 吴俊�; 赵新冬; 戴威; 于聪聪
Original assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2024-05-28
Anticipated expiration: 2041-09-01
Also published as: CN113705208A

Abstract

The invention discloses a Chinese problem generating method and device based on domain terms and key sentences, wherein the method comprises the steps of establishing a dependency syntax structure for sentences in an input document, generating candidate domain terms according to dependency syntax rules, evaluating and sequencing the generated candidate domain terms, and extracting a specified number of domain terms based on sequencing results; performing TF-IDF calculation on words of sentences in the input document to represent the sentences, calculating importance of the sentences by adopting a T-TextRank algorithm, and extracting a specified number of key sentences based on an importance ranking result; and finally, generating a Chinese selection question stem based on the extracted domain terms and key sentences, filling the Chinese blank question stem and generating a Chinese answer question stem. The domain terms and key sentences extracted by the method can greatly improve the importance of the generated problems, and have wide application prospects.

Description

Automatic Chinese problem generation method and device based on field terms and key sentences

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to an automatic Chinese problem generation method based on field terms and key sentences.

Background

In recent years, knowledge assessment or performance assessment is of great importance to educational institutions and enterprises, and assessment in the form of question questionnaires is an effective assessment strategy. However, conventional problem generation based on manual work requires a lot of manpower and time, and thus research on automatic problem generation has changed the current situation. The automatic problem generating technology screens and extracts important knowledge from information in a document by using an information technology and automatically generates problems, thereby replacing the traditional mode of extracting manual writing problems from a test question library.

Existing solutions, for example, the automatic generation (LIU M,RUS V,LIU L.Automatic chinese factual question generation[J].IEEE Transactions on Learning Technologies,2016,10(2):1-1.) of chinese fact questions by using grammar rule templates proposed by Liu et al in 2016 and the method (KHULLAR P,RACHNA K,HASE M,et al.Automatic question generation using relative pronouns and adverbs[C]//Proceedings of ACL 2018,Student Research Workshop.2018:153-158.), of question generation by Khullar et al in 2018, which uses adverbs and preposition information, focus on only sentence pattern templates and grammar dependency trees to generate questions, not domain terms contained in sentences.

However, the automatic problem generating methods all use some problem templates to select nouns in sentences for giving questions. Since Chinese is not a natural separator between words compared with English, word segmentation errors often occur in documents in some specific fields, thereby affecting the effect of problem generation. The traditional problem generation technology divides the domain vocabulary of the human resource strategy, which not only results in low quality of problem generation, but also does not fully check domain knowledge points. While problem generation for knowledge in these areas is more valuable for staff assessment or student learning.

Disclosure of Invention

The invention aims to provide a Chinese problem automatic generation method and device based on domain terms and key sentences, and the importance of the generated problems can be greatly improved through the extracted domain terms and key sentences.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the invention provides a Chinese problem generation method based on domain terms and key sentences, which comprises the following steps:

extracting domain terms and key sentences in the document based on dependency syntax analysis;

Generating multi-type topics based on the extracted domain terms and key sentences;

wherein extracting domain terms in the document based on the dependency syntax analysis includes:

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the ranking result;

Extracting key sentences in the document based on the dependency syntax analysis includes:

calculating TF-IDF values of words in the input document;

Calculating the similarity between sentences in the document based on the TF-IDF value;

Calculating importance of sentences based on the similarity between sentences and sorting;

and extracting a specified number of key sentences based on the importance ranking result of the sentences.

Further, the dependency syntax structure is built in any one of the following ways:

The Stanford dependency syntax analyzer, the neural network based dependency syntax analyzer in Hanlp toolkit and the column search dependency syntax analyzer based on ArcEager migration system.

Further, the dependency syntax rule is:

(dep)？+(amod|nn)*+(nsubj|dobj)；

wherein? Indicating no or more, indicating one or more, dep indicating a dependency, amod indicating an adjective modifier, nn indicating a noun compound modifier, nsubj indicating a noun subject, dobj indicating a direct object.

Further, evaluating and sorting the generated candidate domain terms includes:

Filtering non-terms in the candidate domain terms based on part-of-speech filtering rules, and deleting when the part-of-speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition, a conjunction and a position word;

b. a non-noun;

c. containing separators or symbols;

Filtering the candidate domain terms based on grammar filtering rules, and retaining when the candidate domain terms meet any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating the score of the filtered candidate domain terms:

Where s represents a score, f _word is the frequency of the current candidate domain term, f _(i) is the product of the number of candidate domain terms with frequency i and word frequency i, C ₁ is the total number of candidate domain terms with frequency 1, C is the total number of all extracted candidate domain terms, n is the maximum frequency, and a is the hyper-parameter;

the scores of the candidate domain terms are ranked from large to small.

Further, the calculating TF-IDF values of words in the input document includes:

Where word _p represents the TF-IDF value of word p, c _n is the number of occurrences of word p in the document, N is the total number of words in the document, m represents the number of sentences in the document, and e _p represents the number of sentences containing word p.

Further, the calculating the similarity between sentences in the document based on the TF-IDF value includes:

Where w _ij represents the similarity between sentence S _i and sentence S _j, word _ip represents the TF-IDF value of word p in sentence S _i, and word _jp represents the TF-IDF value of word p in sentence S _j.

Further, the calculating the importance of sentences based on the similarity between sentences and arranging the sentences includes:

Representing sentences as nodes, and forming a diagram by two-way full communication among the sentences;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

Where WS (S _i) represents the importance of sentence S _i, d is the damping coefficient set, in (S _i) represents all nodes pointing to node S _i, and Out (S _j) represents all nodes pointed to by node S _j;

the convergence values of sentence importance are ordered from big to small.

Further, the method comprises the steps of generating multiple types of questions based on the extracted domain terms and key sentences, including generating Chinese selection question stems, generating Chinese gap-filling question stems and generating Chinese answer question stems;

The generating the Chinese selection question stem comprises the following steps:

obtaining an extracted key sentence list, matching in the key sentence list by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and taking corresponding domain term contents as correct choice of choice questions;

Generating a choice question interference term based on at least one of the following strategies:

Selecting domain terms with the same part of speech as the correct options as interference terms by word segmentation of the domain terms;

Selecting the domain terms which are the same as the correct option affix as interference terms;

obtaining a word vector of the domain term by training a word2vec model, and selecting the domain term as an interference item based on cosine similarity of the word vector of the domain term;

selecting domain terms with similar occurrence frequencies in documents from a domain term library as interference terms;

the step of generating the Chinese gap-filling question stem comprises the following steps:

matching in a key sentence library by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and replacing corresponding domain term contents with transverse lines to generate Chinese gap-filling stems;

the step of generating the Chinese question and answer stem comprises the following steps:

when the key sentence contains at least one of the following feature words: "refer to," means, |classes, "" also known, "" defined, "" abbreviation, "" also known, "" means, "" is used, "" also known, "and" refer to, "as well as domain terms, and the term" means "is used to generate the term" interpretation stem;

When at least one of the following causal related words is contained in the key sentence: "because", "so" and "so" are generated by replacing the content of the sentence representing the cause with a question word.

The invention also provides a Chinese question generation device based on the domain terms and the key sentences, which comprises:

The extraction module is used for extracting domain terms and key sentences in the document based on dependency syntactic analysis;

And

The generation module is used for generating multi-type topics based on the extracted domain terms and key sentences;

The extraction module comprises a first extraction module and a second extraction module;

The first extraction module is used for extracting the first data,

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the ranking result;

The second extraction module is used for extracting the data from the first extraction module,

Calculating TF-IDF values of words in the input document;

Further, the first extraction module is specifically configured to,

b. a non-noun;

c. containing separators or symbols;

d. noun + noun;

e. adjectives or nouns + nouns;

calculating the score of the filtered candidate domain terms:

the scores of the candidate domain terms are ranked from large to small.

Further, the second extraction module is specifically configured to,

The TF-IDF values of the words in the input document are calculated as follows:

Further, the second extraction module is specifically configured to,

Where WS (S _i) represents the importance of sentence S _i, d is the damping coefficient set, in (S _i) represents all nodes pointing to node S _i, out (S _j) represents all nodes pointed to by node S _j, w _ij represents the similarity of sentence S _i and sentence S _j, and w _jk represents the similarity of sentence S _j and sentence S _k;

Further, the generating module comprises a first generating module, a second generating module and a third generating module;

the first generation module is used for generating a first output signal,

the second generation module is configured to generate, based on the first generation module,

the third generation module is configured to,

The beneficial effects achieved by the invention are as follows:

The invention extracts the key sentences and the domain terms based on the dependency syntax information and realizes the automatic generation of various question types based on the domain terms. The core algorithm has good expandability, and can be completely applied to the automatic generation of problems in the specific field; the domain terms and key sentences extracted by the extraction method can greatly improve the importance of the generated problems, and have wide application prospects.

Drawings

FIG. 1 is a schematic flow diagram of a method for automatically generating Chinese questions based on domain terms and key sentences according to the invention;

FIG. 2 is a flow diagram of extracting domain terms based on dependency syntax analysis in one embodiment of the invention;

FIG. 3 is a flow diagram of evaluating candidate domain terms in one embodiment of the invention;

FIG. 4 is a flow chart of extracting key sentences based on the T-textRank algorithm according to one embodiment of the present invention;

FIG. 5 is a flow diagram of automatically generating multi-type Chinese questions based on extracted domain terms and key sentences in one embodiment of the invention.

Detailed Description

The invention is further described below. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

The invention provides a Chinese problem generation method based on domain terms and key sentences, which extracts the key sentences and the domain terms based on dependency syntax information and realizes automatic generation of various question types based on the domain terms.

Dependency syntactic analysis (DEPENDENCY PARSING, DP) is one of the key technologies for natural language processing, and the main purpose is to reveal the syntactic structure of a sentence and determine the dependency between words in the sentence by analyzing the dependency between language unit components. The dependency syntax analysis tool may be any dependency syntax analyzer capable of acquiring dependency relationships between words, for example, a Stanford dependency syntax analyzer (Stanford Parser), a neural network-based dependency syntax analyzer in Hanlp toolkit or a ArcEager transfer system-based column search dependency syntax analyzer.

The invention relates to a Chinese question generation method based on domain terms and key sentences, which comprises the following steps:

extracting domain terms and key sentences based on dependency syntax analysis;

a multi-type topic is generated based on the extracted domain terms and key sentences.

One embodiment of the present invention takes a neural network-based dependency syntax analyzer in Hanlp toolkit as an example to generate chinese questions based on domain terms and key sentences, and the specific implementation process is shown in fig. 1, which includes:

step S1, extracting domain terms based on dependency syntax analysis.

In particular, as shown in fig. 2,

In step S101, a neural network based dependency syntax analyzer in Hanlp toolkit builds a dependency syntax structure on the input sentence.

In step S102, candidate domain terms are generated according to the following dependency syntax rules, and Table 1 is a dependency interpretation.

(dep)？+(amod|nn)*+(nsubj|dobj)

TABLE 1 dependency interpretation

Dependency English abbreviations	Meaning interpretation
		？	None or more of
*	One or more of
		dep	Dependency relationship
amod	Adjective modifier
		nn	Noun composite modifier
nsubj	Part of the speech
		dobj	Direct object

And step S103, evaluating and sequencing the candidate domain terms by rules, word frequencies and word affix.

In particular, as shown with reference to fig. 3,

Step S103-1, filtering non-term in candidate domain terms based on part-of-speech filtering rules.

The part of speech tags of the segmented words and the words are obtained from the dependency syntax structure, and candidate words which cannot become terms are filtered by checking the part of speech, for example, words containing human pronouns are deleted. The non-term part-of-speech filtering rules are shown in Table 2, and are deleted when the candidate domain term part-of-speech satisfies any one of the rules.

TABLE 2 part-of-speech filtering rules for non-terminology

Rule sequence number	Rule description
		1	The end of the word is a number word, a preposition, a conjunctive word and a position word
2	Non-nouns
		3	Containing separators or symbols

Step S103-2, filtering candidate domain terms based on grammar filtering rules. Grammar filtering rules are shown in Table 3, and are preserved when the candidate domain term satisfies any one rule.

TABLE 3 grammar filtering rules

Rule sequence number	Rule description
		1	Noun + noun
2	(Adjectives or nouns) +nouns

Step S103-3, calculating the score of the candidate domain term through a multi-factor evaluator.

The multi-factor evaluator calculates the mode as shown in the formula (1), and simultaneously considers the prefix and the suffix of the statistical word frequency and the word group, and the calculation mode is divided into three cases: including hotword term prefixes, including non-term prefixes, and neither.

Where f _word is the frequency of the current word (candidate domain term), f _(i) is the product of the number of words with frequency i and the word frequency i, C ₁ is the total number of words with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, and a is the hyper-parameter (experimentally chosen to be 2).

Step S104, sorting from large to small according to the scores of the candidate domain terms, and extracting a specified number of domain terms.

And S2, extracting key sentences based on a T-textRank algorithm.

In particular, as shown with reference to fig. 4,

Step S201, preprocessing the input document, including sentence segmentation, word segmentation and word deactivation.

Step S202, TF-IDF (Term Frequency-inverse document Frequency) calculation of the word is performed, and the sentence is represented by the feature vector.

Assuming that the chinese document is D, which contains m sentences, D may be represented as d= { S ₁,S₂…S_m }, and at the same time, each sentence may be represented by a feature word vector:

S_i＝{word_i1,word_i2…word_iN}，

N represents the number of words in the entire document, word _in represents the TF-IDF value of word N in sentence S _i.

TF-IDF calculation as shown in equation (2), c _n is the number of times word N occurs in this document, N is the total number of words in the document, and e _n is the number of sentences containing word N.

Word _in then represents the TF-IDF value of word n in sentence S _i.

In step S203, the sentence similarity is calculated based on the cosine similarity, as shown in formula (3).

In step S204, the sentence importance is ordered by using a T-textRank algorithm, and the calculation is shown in a formula (4). Each sentence is expressed as a node, denoted by S, and the two-way full communication between sentences in the document constitutes a graph. The initial weight WS of each node is 1/m, and the initial weight of the edge is sentence similarity w _ij. Where d is the damping coefficient, typically taking a value of 0.85.In (S _i) represents all nodes pointing to node S _i, and Out (S _j) represents all nodes pointed to by node S _j. Equation (4) can converge through several iterative computations.

Step S205, according to the convergence result WS of the T-textRank algorithm, sorting the values from large to small, and screening out the appointed number of key sentences.

It can be seen that the input in step S1 in the present invention is in sentence units and the input in S2 is in text units.

And step S3, automatically generating multi-type Chinese questions based on the extracted domain terms and key sentences.

Specifically, as shown in fig. 5, three types of chinese questions are generated based on the extracted domain terms and key sentences.

S301, generating Chinese selection questions based on domain terms and key sentences, specifically,

Step S301-1, a Chinese choice question stem is generated. First, an extracted key sentence list is obtained, matching is performed in the key sentence list using domain terms in the domain term library as key information, sentences containing the terms and serving as subject or object components are selected as stems, and corresponding domain term parts are used for the questions.

Step S301-2, combining different language features to generate a choice question interference item. The generation strategy is shown in table 4.

Table 4 choice question interference item generation strategy

S302, generating a Chinese gap-filling question stem based on the domain terms and the key sentences.

Specifically, domain terms in a domain term library are used as key information to match in a key sentence library, sentences which contain the terms and serve as subject or object components are selected as stems, and then corresponding domain term parts are replaced by transverse lines.

S303, generating a Chinese question stem based on the domain terms and the key sentences.

When the key sentence contains the following feature words, the following specific words are: "refers to," "means, |classes," "also known," "defined," "abbreviation," "also known," "means," "is used," "also known," and "refer to," also including domain terms, the term "interpretation subject stem" is generated.

When the key sentence contains a causal related word, the following are specific: the "because", "so" and "so" are generated by replacing the part of the sentence representing the cause with a question word.

Another embodiment of the present invention provides a chinese question generation apparatus based on domain terms and key sentences, including:

And

In particular, the first extraction module is used for,

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the ranking result;

Calculating TF-IDF values of words in the input document;

In the embodiment of the present invention, the first extraction module is specifically configured to,

b. a non-noun;

c. containing separators or symbols;

d. noun + noun;

e. adjectives or nouns + nouns;

calculating the score of the filtered candidate domain terms:

the scores of the candidate domain terms are ranked from large to small.

In the embodiment of the present invention, the second extraction module is specifically configured to,

The TF-IDF values of the words in the input document are calculated as follows:

In the embodiment of the invention, the generating module comprises a first generating module, a second generating module and a third generating module;

Wherein, the first generating module is used for generating the first output signal,

The second generation module is used for generating a second generation signal,

the third generation module is used for generating a third generation module,

It should be noted that the embodiment of the apparatus corresponds to the embodiment of the method, and the implementation manner of the embodiment of the method is applicable to the embodiment of the apparatus and can achieve the same or similar technical effects, so that the description thereof is omitted herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A Chinese question generation method based on domain terms and key sentences is characterized by comprising the following steps:

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules; wherein, the dependency syntax structure is established by adopting any one of the following modes: a Stanford dependency syntax analyzer, a neural network-based dependency syntax analyzer in Hanlp toolkit, and a column search dependency syntax analyzer based on ArcEager migration system; the dependency syntax rules are:

(dep)？+(amod|nn)*+(nsubj|dobj)；

wherein? Indicating no or more, indicating one or more, dep indicating a dependency, amod indicating an adjective modifier, nn indicating a noun compound modifier, nsubj indicating a noun subject, dobj indicating a direct object;

evaluating and sequencing the generated candidate domain terms;

extracting a specified number of domain terms based on the ranking result;

calculating TF-IDF values of words in the input document;

Extracting a specified number of key sentences based on the importance ranking result of the sentences;

The method comprises the steps of generating multiple types of questions based on extracted domain terms and key sentences, including generating a Chinese selection question stem, generating a Chinese gap filling question stem and generating a Chinese answer question stem;

2. The method for generating chinese questions based on domain terms and key sentences as claimed in claim 1, wherein evaluating and ranking the generated candidate domain terms comprises:

b. a non-noun;

c. containing separators or symbols;

d. noun + noun;

e. adjectives or nouns + nouns;

calculating the score of the filtered candidate domain terms:

the scores of the candidate domain terms are ranked from large to small.

3. The method for generating chinese questions based on domain terms and keywords as claimed in claim 1, wherein the calculating TF-IDF values of words in the inputted documents comprises:

4. The method for generating chinese questions based on domain terms and key sentences as claimed in claim 3, wherein the calculating the similarity between sentences in the document based on TF-IDF values comprises:

5. The method for generating chinese questions based on domain terms and key sentences as claimed in claim 4, wherein said calculating importance and side-by-side of sentences based on similarity between sentences comprises:

the convergence values of sentence importance are ordered from big to small.

6. A chinese question generation apparatus based on domain terms and key sentences, for implementing the chinese question generation method based on domain terms and key sentences as set forth in any one of claims 1 to 5, the apparatus comprising:

And

The first extraction module is used for extracting the first data,

evaluating and sequencing the generated candidate domain terms;

extracting a specified number of domain terms based on the ranking result;

Calculating TF-IDF values of words in the input document;

7. The apparatus for generating chinese questions based on domain terms and key sentences of claim 6, wherein the first extraction module is configured to,

b. a non-noun;

c. containing separators or symbols;

d. noun + noun;

e. adjectives or nouns + nouns;

calculating the score of the filtered candidate domain terms:

the scores of the candidate domain terms are ranked from large to small.

8. The apparatus for generating chinese questions based on domain terms and key sentences of claim 6, wherein the second extraction module is configured to,

The TF-IDF values of the words in the input document are calculated as follows:

9. The apparatus for generating chinese questions based on domain terms and key sentences of claim 8, wherein the second extraction module is configured to,

the convergence values of sentence importance are ordered from big to small.

10. The apparatus for generating chinese questions based on domain terms and phrases as recited in claim 6, wherein said generating means comprises a first generating means, a second generating means and a third generating means;

the first generation module is used for generating a first output signal,

the third generation module is configured to,