CN113705208B - Automatic Chinese problem generation method and device based on field terms and key sentences - Google Patents

Automatic Chinese problem generation method and device based on field terms and key sentences Download PDF

Info

Publication number
CN113705208B
CN113705208B CN202111019721.3A CN202111019721A CN113705208B CN 113705208 B CN113705208 B CN 113705208B CN 202111019721 A CN202111019721 A CN 202111019721A CN 113705208 B CN113705208 B CN 113705208B
Authority
CN
China
Prior art keywords
sentences
terms
domain
word
domain terms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111019721.3A
Other languages
Chinese (zh)
Other versions
CN113705208A (en
Inventor
赵军
董勤伟
查显光
吴俊�
赵新冬
戴威
于聪聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN202111019721.3A priority Critical patent/CN113705208B/en
Publication of CN113705208A publication Critical patent/CN113705208A/en
Application granted granted Critical
Publication of CN113705208B publication Critical patent/CN113705208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese problem generating method and device based on domain terms and key sentences, wherein the method comprises the steps of establishing a dependency syntax structure for sentences in an input document, generating candidate domain terms according to dependency syntax rules, evaluating and sequencing the generated candidate domain terms, and extracting a specified number of domain terms based on sequencing results; performing TF-IDF calculation on words of sentences in the input document to represent the sentences, calculating importance of the sentences by adopting a T-TextRank algorithm, and extracting a specified number of key sentences based on an importance ranking result; and finally, generating a Chinese selection question stem based on the extracted domain terms and key sentences, filling the Chinese blank question stem and generating a Chinese answer question stem. The domain terms and key sentences extracted by the method can greatly improve the importance of the generated problems, and have wide application prospects.

Description

Automatic Chinese problem generation method and device based on field terms and key sentences
Technical Field
The invention belongs to the technical field of information extraction, and particularly relates to an automatic Chinese problem generation method based on field terms and key sentences.
Background
In recent years, knowledge assessment or performance assessment is of great importance to educational institutions and enterprises, and assessment in the form of question questionnaires is an effective assessment strategy. However, conventional problem generation based on manual work requires a lot of manpower and time, and thus research on automatic problem generation has changed the current situation. The automatic problem generating technology screens and extracts important knowledge from information in a document by using an information technology and automatically generates problems, thereby replacing the traditional mode of extracting manual writing problems from a test question library.
Existing solutions, for example, the automatic generation (LIU M,RUS V,LIU L.Automatic chinese factual question generation[J].IEEE Transactions on Learning Technologies,2016,10(2):1-1.) of chinese fact questions by using grammar rule templates proposed by Liu et al in 2016 and the method (KHULLAR P,RACHNA K,HASE M,et al.Automatic question generation using relative pronouns and adverbs[C]//Proceedings of ACL 2018,Student Research Workshop.2018:153-158.), of question generation by Khullar et al in 2018, which uses adverbs and preposition information, focus on only sentence pattern templates and grammar dependency trees to generate questions, not domain terms contained in sentences.
However, the automatic problem generating methods all use some problem templates to select nouns in sentences for giving questions. Since Chinese is not a natural separator between words compared with English, word segmentation errors often occur in documents in some specific fields, thereby affecting the effect of problem generation. The traditional problem generation technology divides the domain vocabulary of the human resource strategy, which not only results in low quality of problem generation, but also does not fully check domain knowledge points. While problem generation for knowledge in these areas is more valuable for staff assessment or student learning.
Disclosure of Invention
The invention aims to provide a Chinese problem automatic generation method and device based on domain terms and key sentences, and the importance of the generated problems can be greatly improved through the extracted domain terms and key sentences.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the invention provides a Chinese problem generation method based on domain terms and key sentences, which comprises the following steps:
extracting domain terms and key sentences in the document based on dependency syntax analysis;
Generating multi-type topics based on the extracted domain terms and key sentences;
wherein extracting domain terms in the document based on the dependency syntax analysis includes:
establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules;
evaluating and ranking the generated candidate domain terms;
extracting a specified number of domain terms based on the ranking result;
Extracting key sentences in the document based on the dependency syntax analysis includes:
calculating TF-IDF values of words in the input document;
Calculating the similarity between sentences in the document based on the TF-IDF value;
Calculating importance of sentences based on the similarity between sentences and sorting;
and extracting a specified number of key sentences based on the importance ranking result of the sentences.
Further, the dependency syntax structure is built in any one of the following ways:
The Stanford dependency syntax analyzer, the neural network based dependency syntax analyzer in Hanlp toolkit and the column search dependency syntax analyzer based on ArcEager migration system.
Further, the dependency syntax rule is:
(dep)?+(amod|nn)*+(nsubj|dobj);
wherein? Indicating no or more, indicating one or more, dep indicating a dependency, amod indicating an adjective modifier, nn indicating a noun compound modifier, nsubj indicating a noun subject, dobj indicating a direct object.
Further, evaluating and sorting the generated candidate domain terms includes:
Filtering non-terms in the candidate domain terms based on part-of-speech filtering rules, and deleting when the part-of-speech in the candidate domain terms meets any one of the following rules:
a. the end of the word is a number word, a preposition, a conjunction and a position word;
b. a non-noun;
c. containing separators or symbols;
Filtering the candidate domain terms based on grammar filtering rules, and retaining when the candidate domain terms meet any one of the following rules:
d. noun + noun;
e. adjectives or nouns + nouns;
calculating the score of the filtered candidate domain terms:
Where s represents a score, f word is the frequency of the current candidate domain term, f (i) is the product of the number of candidate domain terms with frequency i and word frequency i, C 1 is the total number of candidate domain terms with frequency 1, C is the total number of all extracted candidate domain terms, n is the maximum frequency, and a is the hyper-parameter;
the scores of the candidate domain terms are ranked from large to small.
Further, the calculating TF-IDF values of words in the input document includes:
Where word p represents the TF-IDF value of word p, c n is the number of occurrences of word p in the document, N is the total number of words in the document, m represents the number of sentences in the document, and e p represents the number of sentences containing word p.
Further, the calculating the similarity between sentences in the document based on the TF-IDF value includes:
Where w ij represents the similarity between sentence S i and sentence S j, word ip represents the TF-IDF value of word p in sentence S i, and word jp represents the TF-IDF value of word p in sentence S j.
Further, the calculating the importance of sentences based on the similarity between sentences and arranging the sentences includes:
Representing sentences as nodes, and forming a diagram by two-way full communication among the sentences;
the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:
Where WS (S i) represents the importance of sentence S i, d is the damping coefficient set, in (S i) represents all nodes pointing to node S i, and Out (S j) represents all nodes pointed to by node S j;
the convergence values of sentence importance are ordered from big to small.
Further, the method comprises the steps of generating multiple types of questions based on the extracted domain terms and key sentences, including generating Chinese selection question stems, generating Chinese gap-filling question stems and generating Chinese answer question stems;
The generating the Chinese selection question stem comprises the following steps:
obtaining an extracted key sentence list, matching in the key sentence list by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and taking corresponding domain term contents as correct choice of choice questions;
Generating a choice question interference term based on at least one of the following strategies:
Selecting domain terms with the same part of speech as the correct options as interference terms by word segmentation of the domain terms;
Selecting the domain terms which are the same as the correct option affix as interference terms;
obtaining a word vector of the domain term by training a word2vec model, and selecting the domain term as an interference item based on cosine similarity of the word vector of the domain term;
selecting domain terms with similar occurrence frequencies in documents from a domain term library as interference terms;
the step of generating the Chinese gap-filling question stem comprises the following steps:
matching in a key sentence library by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and replacing corresponding domain term contents with transverse lines to generate Chinese gap-filling stems;
the step of generating the Chinese question and answer stem comprises the following steps:
when the key sentence contains at least one of the following feature words: "refer to," means, |classes, "" also known, "" defined, "" abbreviation, "" also known, "" means, "" is used, "" also known, "and" refer to, "as well as domain terms, and the term" means "is used to generate the term" interpretation stem;
When at least one of the following causal related words is contained in the key sentence: "because", "so" and "so" are generated by replacing the content of the sentence representing the cause with a question word.
The invention also provides a Chinese question generation device based on the domain terms and the key sentences, which comprises:
The extraction module is used for extracting domain terms and key sentences in the document based on dependency syntactic analysis;
And
The generation module is used for generating multi-type topics based on the extracted domain terms and key sentences;
The extraction module comprises a first extraction module and a second extraction module;
The first extraction module is used for extracting the first data,
Establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules;
evaluating and ranking the generated candidate domain terms;
extracting a specified number of domain terms based on the ranking result;
The second extraction module is used for extracting the data from the first extraction module,
Calculating TF-IDF values of words in the input document;
Calculating the similarity between sentences in the document based on the TF-IDF value;
Calculating importance of sentences based on the similarity between sentences and sorting;
and extracting a specified number of key sentences based on the importance ranking result of the sentences.
Further, the first extraction module is specifically configured to,
Filtering non-terms in the candidate domain terms based on part-of-speech filtering rules, and deleting when the part-of-speech in the candidate domain terms meets any one of the following rules:
a. the end of the word is a number word, a preposition, a conjunction and a position word;
b. a non-noun;
c. containing separators or symbols;
Filtering the candidate domain terms based on grammar filtering rules, and retaining when the candidate domain terms meet any one of the following rules:
d. noun + noun;
e. adjectives or nouns + nouns;
calculating the score of the filtered candidate domain terms:
Where s represents a score, f word is the frequency of the current candidate domain term, f (i) is the product of the number of candidate domain terms with frequency i and word frequency i, C 1 is the total number of candidate domain terms with frequency 1, C is the total number of all extracted candidate domain terms, n is the maximum frequency, and a is the hyper-parameter;
the scores of the candidate domain terms are ranked from large to small.
Further, the second extraction module is specifically configured to,
The TF-IDF values of the words in the input document are calculated as follows:
Where word p represents the TF-IDF value of word p, c n is the number of occurrences of word p in the document, N is the total number of words in the document, m represents the number of sentences in the document, and e p represents the number of sentences containing word p.
Further, the second extraction module is specifically configured to,
Representing sentences as nodes, and forming a diagram by two-way full communication among the sentences;
the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:
Where WS (S i) represents the importance of sentence S i, d is the damping coefficient set, in (S i) represents all nodes pointing to node S i, out (S j) represents all nodes pointed to by node S j, w ij represents the similarity of sentence S i and sentence S j, and w jk represents the similarity of sentence S j and sentence S k;
Further, the generating module comprises a first generating module, a second generating module and a third generating module;
the first generation module is used for generating a first output signal,
Obtaining an extracted key sentence list, matching in the key sentence list by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and taking corresponding domain term contents as correct choice of choice questions;
Generating a choice question interference term based on at least one of the following strategies:
Selecting domain terms with the same part of speech as the correct options as interference terms by word segmentation of the domain terms;
Selecting the domain terms which are the same as the correct option affix as interference terms;
obtaining a word vector of the domain term by training a word2vec model, and selecting the domain term as an interference item based on cosine similarity of the word vector of the domain term;
selecting domain terms with similar occurrence frequencies in documents from a domain term library as interference terms;
the second generation module is configured to generate, based on the first generation module,
Matching in a key sentence library by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and replacing corresponding domain term contents with transverse lines to generate Chinese gap-filling stems;
the third generation module is configured to,
When the key sentence contains at least one of the following feature words: "refer to," means, |classes, "" also known, "" defined, "" abbreviation, "" also known, "" means, "" is used, "" also known, "and" refer to, "as well as domain terms, and the term" means "is used to generate the term" interpretation stem;
When at least one of the following causal related words is contained in the key sentence: "because", "so" and "so" are generated by replacing the content of the sentence representing the cause with a question word.
The beneficial effects achieved by the invention are as follows:
The invention extracts the key sentences and the domain terms based on the dependency syntax information and realizes the automatic generation of various question types based on the domain terms. The core algorithm has good expandability, and can be completely applied to the automatic generation of problems in the specific field; the domain terms and key sentences extracted by the extraction method can greatly improve the importance of the generated problems, and have wide application prospects.
Drawings
FIG. 1 is a schematic flow diagram of a method for automatically generating Chinese questions based on domain terms and key sentences according to the invention;
FIG. 2 is a flow diagram of extracting domain terms based on dependency syntax analysis in one embodiment of the invention;
FIG. 3 is a flow diagram of evaluating candidate domain terms in one embodiment of the invention;
FIG. 4 is a flow chart of extracting key sentences based on the T-textRank algorithm according to one embodiment of the present invention;
FIG. 5 is a flow diagram of automatically generating multi-type Chinese questions based on extracted domain terms and key sentences in one embodiment of the invention.
Detailed Description
The invention is further described below. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
The invention provides a Chinese problem generation method based on domain terms and key sentences, which extracts the key sentences and the domain terms based on dependency syntax information and realizes automatic generation of various question types based on the domain terms.
Dependency syntactic analysis (DEPENDENCY PARSING, DP) is one of the key technologies for natural language processing, and the main purpose is to reveal the syntactic structure of a sentence and determine the dependency between words in the sentence by analyzing the dependency between language unit components. The dependency syntax analysis tool may be any dependency syntax analyzer capable of acquiring dependency relationships between words, for example, a Stanford dependency syntax analyzer (Stanford Parser), a neural network-based dependency syntax analyzer in Hanlp toolkit or a ArcEager transfer system-based column search dependency syntax analyzer.
The invention relates to a Chinese question generation method based on domain terms and key sentences, which comprises the following steps:
extracting domain terms and key sentences based on dependency syntax analysis;
a multi-type topic is generated based on the extracted domain terms and key sentences.
One embodiment of the present invention takes a neural network-based dependency syntax analyzer in Hanlp toolkit as an example to generate chinese questions based on domain terms and key sentences, and the specific implementation process is shown in fig. 1, which includes:
step S1, extracting domain terms based on dependency syntax analysis.
In particular, as shown in fig. 2,
In step S101, a neural network based dependency syntax analyzer in Hanlp toolkit builds a dependency syntax structure on the input sentence.
In step S102, candidate domain terms are generated according to the following dependency syntax rules, and Table 1 is a dependency interpretation.
(dep)?+(amod|nn)*+(nsubj|dobj)
TABLE 1 dependency interpretation
Dependency English abbreviations Meaning interpretation
None or more of
* One or more of
dep Dependency relationship
amod Adjective modifier
nn Noun composite modifier
nsubj Part of the speech
dobj Direct object
And step S103, evaluating and sequencing the candidate domain terms by rules, word frequencies and word affix.
In particular, as shown with reference to fig. 3,
Step S103-1, filtering non-term in candidate domain terms based on part-of-speech filtering rules.
The part of speech tags of the segmented words and the words are obtained from the dependency syntax structure, and candidate words which cannot become terms are filtered by checking the part of speech, for example, words containing human pronouns are deleted. The non-term part-of-speech filtering rules are shown in Table 2, and are deleted when the candidate domain term part-of-speech satisfies any one of the rules.
TABLE 2 part-of-speech filtering rules for non-terminology
Rule sequence number Rule description
1 The end of the word is a number word, a preposition, a conjunctive word and a position word
2 Non-nouns
3 Containing separators or symbols
Step S103-2, filtering candidate domain terms based on grammar filtering rules. Grammar filtering rules are shown in Table 3, and are preserved when the candidate domain term satisfies any one rule.
TABLE 3 grammar filtering rules
Rule sequence number Rule description
1 Noun + noun
2 (Adjectives or nouns) +nouns
Step S103-3, calculating the score of the candidate domain term through a multi-factor evaluator.
The multi-factor evaluator calculates the mode as shown in the formula (1), and simultaneously considers the prefix and the suffix of the statistical word frequency and the word group, and the calculation mode is divided into three cases: including hotword term prefixes, including non-term prefixes, and neither.
Where f word is the frequency of the current word (candidate domain term), f (i) is the product of the number of words with frequency i and the word frequency i, C 1 is the total number of words with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, and a is the hyper-parameter (experimentally chosen to be 2).
Step S104, sorting from large to small according to the scores of the candidate domain terms, and extracting a specified number of domain terms.
And S2, extracting key sentences based on a T-textRank algorithm.
In particular, as shown with reference to fig. 4,
Step S201, preprocessing the input document, including sentence segmentation, word segmentation and word deactivation.
Step S202, TF-IDF (Term Frequency-inverse document Frequency) calculation of the word is performed, and the sentence is represented by the feature vector.
Assuming that the chinese document is D, which contains m sentences, D may be represented as d= { S 1,S2…Sm }, and at the same time, each sentence may be represented by a feature word vector:
Si={wordi1,wordi2…wordiN},
N represents the number of words in the entire document, word in represents the TF-IDF value of word N in sentence S i.
TF-IDF calculation as shown in equation (2), c n is the number of times word N occurs in this document, N is the total number of words in the document, and e n is the number of sentences containing word N.
Word in then represents the TF-IDF value of word n in sentence S i.
In step S203, the sentence similarity is calculated based on the cosine similarity, as shown in formula (3).
In step S204, the sentence importance is ordered by using a T-textRank algorithm, and the calculation is shown in a formula (4). Each sentence is expressed as a node, denoted by S, and the two-way full communication between sentences in the document constitutes a graph. The initial weight WS of each node is 1/m, and the initial weight of the edge is sentence similarity w ij. Where d is the damping coefficient, typically taking a value of 0.85.In (S i) represents all nodes pointing to node S i, and Out (S j) represents all nodes pointed to by node S j. Equation (4) can converge through several iterative computations.
Step S205, according to the convergence result WS of the T-textRank algorithm, sorting the values from large to small, and screening out the appointed number of key sentences.
It can be seen that the input in step S1 in the present invention is in sentence units and the input in S2 is in text units.
And step S3, automatically generating multi-type Chinese questions based on the extracted domain terms and key sentences.
Specifically, as shown in fig. 5, three types of chinese questions are generated based on the extracted domain terms and key sentences.
S301, generating Chinese selection questions based on domain terms and key sentences, specifically,
Step S301-1, a Chinese choice question stem is generated. First, an extracted key sentence list is obtained, matching is performed in the key sentence list using domain terms in the domain term library as key information, sentences containing the terms and serving as subject or object components are selected as stems, and corresponding domain term parts are used for the questions.
Step S301-2, combining different language features to generate a choice question interference item. The generation strategy is shown in table 4.
Table 4 choice question interference item generation strategy
S302, generating a Chinese gap-filling question stem based on the domain terms and the key sentences.
Specifically, domain terms in a domain term library are used as key information to match in a key sentence library, sentences which contain the terms and serve as subject or object components are selected as stems, and then corresponding domain term parts are replaced by transverse lines.
S303, generating a Chinese question stem based on the domain terms and the key sentences.
When the key sentence contains the following feature words, the following specific words are: "refers to," "means, |classes," "also known," "defined," "abbreviation," "also known," "means," "is used," "also known," and "refer to," also including domain terms, the term "interpretation subject stem" is generated.
When the key sentence contains a causal related word, the following are specific: the "because", "so" and "so" are generated by replacing the part of the sentence representing the cause with a question word.
Another embodiment of the present invention provides a chinese question generation apparatus based on domain terms and key sentences, including:
The extraction module is used for extracting domain terms and key sentences in the document based on dependency syntactic analysis;
And
The generation module is used for generating multi-type topics based on the extracted domain terms and key sentences;
the extraction module comprises a first extraction module and a second extraction module;
In particular, the first extraction module is used for,
Establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules;
evaluating and ranking the generated candidate domain terms;
extracting a specified number of domain terms based on the ranking result;
the second extraction module is used for extracting the data from the first extraction module,
Calculating TF-IDF values of words in the input document;
Calculating the similarity between sentences in the document based on the TF-IDF value;
Calculating importance of sentences based on the similarity between sentences and sorting;
and extracting a specified number of key sentences based on the importance ranking result of the sentences.
In the embodiment of the present invention, the first extraction module is specifically configured to,
Filtering non-terms in the candidate domain terms based on part-of-speech filtering rules, and deleting when the part-of-speech in the candidate domain terms meets any one of the following rules:
a. the end of the word is a number word, a preposition, a conjunction and a position word;
b. a non-noun;
c. containing separators or symbols;
Filtering the candidate domain terms based on grammar filtering rules, and retaining when the candidate domain terms meet any one of the following rules:
d. noun + noun;
e. adjectives or nouns + nouns;
calculating the score of the filtered candidate domain terms:
Where s represents a score, f word is the frequency of the current candidate domain term, f (i) is the product of the number of candidate domain terms with frequency i and word frequency i, C 1 is the total number of candidate domain terms with frequency 1, C is the total number of all extracted candidate domain terms, n is the maximum frequency, and a is the hyper-parameter;
the scores of the candidate domain terms are ranked from large to small.
In the embodiment of the present invention, the second extraction module is specifically configured to,
The TF-IDF values of the words in the input document are calculated as follows:
Where word p represents the TF-IDF value of word p, c n is the number of occurrences of word p in the document, N is the total number of words in the document, m represents the number of sentences in the document, and e p represents the number of sentences containing word p.
In the embodiment of the present invention, the second extraction module is specifically configured to,
Representing sentences as nodes, and forming a diagram by two-way full communication among the sentences;
the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:
Where WS (S i) represents the importance of sentence S i, d is the damping coefficient set, in (S i) represents all nodes pointing to node S i, out (S j) represents all nodes pointed to by node S j, w ij represents the similarity of sentence S i and sentence S j, and w jk represents the similarity of sentence S j and sentence S k;
In the embodiment of the invention, the generating module comprises a first generating module, a second generating module and a third generating module;
Wherein, the first generating module is used for generating the first output signal,
Obtaining an extracted key sentence list, matching in the key sentence list by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and taking corresponding domain term contents as correct choice of choice questions;
Generating a choice question interference term based on at least one of the following strategies:
Selecting domain terms with the same part of speech as the correct options as interference terms by word segmentation of the domain terms;
Selecting the domain terms which are the same as the correct option affix as interference terms;
obtaining a word vector of the domain term by training a word2vec model, and selecting the domain term as an interference item based on cosine similarity of the word vector of the domain term;
selecting domain terms with similar occurrence frequencies in documents from a domain term library as interference terms;
The second generation module is used for generating a second generation signal,
Matching in a key sentence library by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and replacing corresponding domain term contents with transverse lines to generate Chinese gap-filling stems;
the third generation module is used for generating a third generation module,
When the key sentence contains at least one of the following feature words: "refer to," means, |classes, "" also known, "" defined, "" abbreviation, "" also known, "" means, "" is used, "" also known, "and" refer to, "as well as domain terms, and the term" means "is used to generate the term" interpretation stem;
When at least one of the following causal related words is contained in the key sentence: "because", "so" and "so" are generated by replacing the content of the sentence representing the cause with a question word.
It should be noted that the embodiment of the apparatus corresponds to the embodiment of the method, and the implementation manner of the embodiment of the method is applicable to the embodiment of the apparatus and can achieve the same or similar technical effects, so that the description thereof is omitted herein.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (10)

1. A Chinese question generation method based on domain terms and key sentences is characterized by comprising the following steps:
extracting domain terms and key sentences in the document based on dependency syntax analysis;
Generating multi-type topics based on the extracted domain terms and key sentences;
wherein extracting domain terms in the document based on the dependency syntax analysis includes:
establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules; wherein, the dependency syntax structure is established by adopting any one of the following modes: a Stanford dependency syntax analyzer, a neural network-based dependency syntax analyzer in Hanlp toolkit, and a column search dependency syntax analyzer based on ArcEager migration system; the dependency syntax rules are:
(dep)?+(amod|nn)*+(nsubj|dobj);
wherein? Indicating no or more, indicating one or more, dep indicating a dependency, amod indicating an adjective modifier, nn indicating a noun compound modifier, nsubj indicating a noun subject, dobj indicating a direct object;
evaluating and sequencing the generated candidate domain terms;
extracting a specified number of domain terms based on the ranking result;
Extracting key sentences in the document based on the dependency syntax analysis includes:
calculating TF-IDF values of words in the input document;
Calculating the similarity between sentences in the document based on the TF-IDF value;
Calculating importance of sentences based on the similarity between sentences and sorting;
Extracting a specified number of key sentences based on the importance ranking result of the sentences;
The method comprises the steps of generating multiple types of questions based on extracted domain terms and key sentences, including generating a Chinese selection question stem, generating a Chinese gap filling question stem and generating a Chinese answer question stem;
The generating the Chinese selection question stem comprises the following steps:
obtaining an extracted key sentence list, matching in the key sentence list by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and taking corresponding domain term contents as correct choice of choice questions;
Generating a choice question interference term based on at least one of the following strategies:
Selecting domain terms with the same part of speech as the correct options as interference terms by word segmentation of the domain terms;
Selecting the domain terms which are the same as the correct option affix as interference terms;
obtaining a word vector of the domain term by training a word2vec model, and selecting the domain term as an interference item based on cosine similarity of the word vector of the domain term;
selecting domain terms with similar occurrence frequencies in documents from a domain term library as interference terms;
the step of generating the Chinese gap-filling question stem comprises the following steps:
matching in a key sentence library by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and replacing corresponding domain term contents with transverse lines to generate Chinese gap-filling stems;
the step of generating the Chinese question and answer stem comprises the following steps:
when the key sentence contains at least one of the following feature words: "refer to," means, |classes, "" also known, "" defined, "" abbreviation, "" also known, "" means, "" is used, "" also known, "and" refer to, "as well as domain terms, and the term" means "is used to generate the term" interpretation stem;
When at least one of the following causal related words is contained in the key sentence: "because", "so" and "so" are generated by replacing the content of the sentence representing the cause with a question word.
2. The method for generating chinese questions based on domain terms and key sentences as claimed in claim 1, wherein evaluating and ranking the generated candidate domain terms comprises:
Filtering non-terms in the candidate domain terms based on part-of-speech filtering rules, and deleting when the part-of-speech in the candidate domain terms meets any one of the following rules:
a. the end of the word is a number word, a preposition, a conjunction and a position word;
b. a non-noun;
c. containing separators or symbols;
Filtering the candidate domain terms based on grammar filtering rules, and retaining when the candidate domain terms meet any one of the following rules:
d. noun + noun;
e. adjectives or nouns + nouns;
calculating the score of the filtered candidate domain terms:
Where s represents a score, f word is the frequency of the current candidate domain term, f (i) is the product of the number of candidate domain terms with frequency i and word frequency i, C 1 is the total number of candidate domain terms with frequency 1, C is the total number of all extracted candidate domain terms, n is the maximum frequency, and a is the hyper-parameter;
the scores of the candidate domain terms are ranked from large to small.
3. The method for generating chinese questions based on domain terms and keywords as claimed in claim 1, wherein the calculating TF-IDF values of words in the inputted documents comprises:
Where word p represents the TF-IDF value of word p, c n is the number of occurrences of word p in the document, N is the total number of words in the document, m represents the number of sentences in the document, and e p represents the number of sentences containing word p.
4. The method for generating chinese questions based on domain terms and key sentences as claimed in claim 3, wherein the calculating the similarity between sentences in the document based on TF-IDF values comprises:
Where w ij represents the similarity between sentence S i and sentence S j, word ip represents the TF-IDF value of word p in sentence S i, and word jp represents the TF-IDF value of word p in sentence S j.
5. The method for generating chinese questions based on domain terms and key sentences as claimed in claim 4, wherein said calculating importance and side-by-side of sentences based on similarity between sentences comprises:
Representing sentences as nodes, and forming a diagram by two-way full communication among the sentences;
the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:
Where WS (S i) represents the importance of sentence S i, d is the damping coefficient set, in (S i) represents all nodes pointing to node S i, and Out (S j) represents all nodes pointed to by node S j;
the convergence values of sentence importance are ordered from big to small.
6. A chinese question generation apparatus based on domain terms and key sentences, for implementing the chinese question generation method based on domain terms and key sentences as set forth in any one of claims 1 to 5, the apparatus comprising:
The extraction module is used for extracting domain terms and key sentences in the document based on dependency syntactic analysis;
And
The generation module is used for generating multi-type topics based on the extracted domain terms and key sentences;
The extraction module comprises a first extraction module and a second extraction module;
The first extraction module is used for extracting the first data,
Establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to the dependency syntax rules;
evaluating and sequencing the generated candidate domain terms;
extracting a specified number of domain terms based on the ranking result;
The second extraction module is used for extracting the data from the first extraction module,
Calculating TF-IDF values of words in the input document;
Calculating the similarity between sentences in the document based on the TF-IDF value;
Calculating importance of sentences based on the similarity between sentences and sorting;
and extracting a specified number of key sentences based on the importance ranking result of the sentences.
7. The apparatus for generating chinese questions based on domain terms and key sentences of claim 6, wherein the first extraction module is configured to,
Filtering non-terms in the candidate domain terms based on part-of-speech filtering rules, and deleting when the part-of-speech in the candidate domain terms meets any one of the following rules:
a. the end of the word is a number word, a preposition, a conjunction and a position word;
b. a non-noun;
c. containing separators or symbols;
Filtering the candidate domain terms based on grammar filtering rules, and retaining when the candidate domain terms meet any one of the following rules:
d. noun + noun;
e. adjectives or nouns + nouns;
calculating the score of the filtered candidate domain terms:
Where s represents a score, f word is the frequency of the current candidate domain term, f (i) is the product of the number of candidate domain terms with frequency i and word frequency i, C 1 is the total number of candidate domain terms with frequency 1, C is the total number of all extracted candidate domain terms, n is the maximum frequency, and a is the hyper-parameter;
the scores of the candidate domain terms are ranked from large to small.
8. The apparatus for generating chinese questions based on domain terms and key sentences of claim 6, wherein the second extraction module is configured to,
The TF-IDF values of the words in the input document are calculated as follows:
Where word p represents the TF-IDF value of word p, c n is the number of occurrences of word p in the document, N is the total number of words in the document, m represents the number of sentences in the document, and e p represents the number of sentences containing word p.
9. The apparatus for generating chinese questions based on domain terms and key sentences of claim 8, wherein the second extraction module is configured to,
Representing sentences as nodes, and forming a diagram by two-way full communication among the sentences;
the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:
Where WS (S i) represents the importance of sentence S i, d is the damping coefficient set, in (S i) represents all nodes pointing to node S i, out (S j) represents all nodes pointed to by node S j, w ij represents the similarity of sentence S i and sentence S j, and w jk represents the similarity of sentence S j and sentence S k;
the convergence values of sentence importance are ordered from big to small.
10. The apparatus for generating chinese questions based on domain terms and phrases as recited in claim 6, wherein said generating means comprises a first generating means, a second generating means and a third generating means;
the first generation module is used for generating a first output signal,
Obtaining an extracted key sentence list, matching in the key sentence list by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and taking corresponding domain term contents as correct choice of choice questions;
Generating a choice question interference term based on at least one of the following strategies:
Selecting domain terms with the same part of speech as the correct options as interference terms by word segmentation of the domain terms;
Selecting the domain terms which are the same as the correct option affix as interference terms;
obtaining a word vector of the domain term by training a word2vec model, and selecting the domain term as an interference item based on cosine similarity of the word vector of the domain term;
selecting domain terms with similar occurrence frequencies in documents from a domain term library as interference terms;
the second generation module is configured to generate, based on the first generation module,
Matching in a key sentence library by using domain terms in a domain term library as key information, selecting sentences which contain the terms and serve as subject or object components as stems, and replacing corresponding domain term contents with transverse lines to generate Chinese gap-filling stems;
the third generation module is configured to,
When the key sentence contains at least one of the following feature words: "refer to," means, |classes, "" also known, "" defined, "" abbreviation, "" also known, "" means, "" is used, "" also known, "and" refer to, "as well as domain terms, and the term" means "is used to generate the term" interpretation stem;
When at least one of the following causal related words is contained in the key sentence: "because", "so" and "so" are generated by replacing the content of the sentence representing the cause with a question word.
CN202111019721.3A 2021-09-01 2021-09-01 Automatic Chinese problem generation method and device based on field terms and key sentences Active CN113705208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111019721.3A CN113705208B (en) 2021-09-01 2021-09-01 Automatic Chinese problem generation method and device based on field terms and key sentences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111019721.3A CN113705208B (en) 2021-09-01 2021-09-01 Automatic Chinese problem generation method and device based on field terms and key sentences

Publications (2)

Publication Number Publication Date
CN113705208A CN113705208A (en) 2021-11-26
CN113705208B true CN113705208B (en) 2024-05-28

Family

ID=78658645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111019721.3A Active CN113705208B (en) 2021-09-01 2021-09-01 Automatic Chinese problem generation method and device based on field terms and key sentences

Country Status (1)

Country Link
CN (1) CN113705208B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170107282A (en) * 2016-03-15 2017-09-25 한국전자통신연구원 Apparatus and method for supporting decision making based on natural language understanding and question and answer
CN108363743A (en) * 2018-01-24 2018-08-03 清华大学深圳研究生院 A kind of intelligence questions generation method, device and computer readable storage medium
CN111930914A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Question generation method and device, electronic equipment and computer-readable storage medium
CN112163405A (en) * 2020-09-08 2021-01-01 北京百度网讯科技有限公司 Question generation method and device
CN112686025A (en) * 2021-01-27 2021-04-20 浙江工商大学 Chinese choice question interference item generation method based on free text
CN113128206A (en) * 2021-04-26 2021-07-16 中国科学技术大学 Question generation method based on word importance weighting

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170107282A (en) * 2016-03-15 2017-09-25 한국전자통신연구원 Apparatus and method for supporting decision making based on natural language understanding and question and answer
CN108363743A (en) * 2018-01-24 2018-08-03 清华大学深圳研究生院 A kind of intelligence questions generation method, device and computer readable storage medium
CN111930914A (en) * 2020-08-14 2020-11-13 工银科技有限公司 Question generation method and device, electronic equipment and computer-readable storage medium
CN112163405A (en) * 2020-09-08 2021-01-01 北京百度网讯科技有限公司 Question generation method and device
CN112686025A (en) * 2021-01-27 2021-04-20 浙江工商大学 Chinese choice question interference item generation method based on free text
CN113128206A (en) * 2021-04-26 2021-07-16 中国科学技术大学 Question generation method based on word importance weighting

Also Published As

Publication number Publication date
CN113705208A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
US10339453B2 (en) Automatically generating test/training questions and answers through pattern based analysis and natural language processing techniques on the given corpus for quick domain adaptation
Kolomiyets et al. A survey on question answering technology from an information retrieval perspective
US8938410B2 (en) Open information extraction from the web
Sawyer et al. Shallow knowledge as an aid to deep understanding in early phase requirements engineering
US10296584B2 (en) Semantic textual analysis
US20100036654A1 (en) Systems and methods for identifying collocation errors in text
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
Sahu et al. Prashnottar: a Hindi question answering system
US20170169355A1 (en) Ground Truth Improvement Via Machine Learned Similar Passage Detection
Quirchmayr et al. Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals: An approach and evaluation at Roche Diagnostics GmbH
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
Utomo et al. New instances classification framework on Quran ontology applied to question answering system
US20220237383A1 (en) Concept system for a natural language understanding (nlu) framework
Wang et al. Research and implementation of English grammar check and error correction based on Deep Learning
Jabalameli et al. Ontology‐lexicon–based question answering over linked data
Riza et al. Natural language processing and levenshtein distance for generating error identification typed questions on TOEFL
Putri et al. Software feature extraction using infrequent feature extraction
CN113128224A (en) Chinese error correction method, device and equipment and readable storage medium
CN113705208B (en) Automatic Chinese problem generation method and device based on field terms and key sentences
Malhar et al. Deep learning based Answering Questions using T5 and Structured Question Generation System’
Abdiansah et al. Survey on answer validation for Indonesian question answering system (IQAS)
Ariyanto et al. Semantic Role Labeling for Information Extraction on Indonesian Texts: A Literature Review
He et al. [Retracted] Application of Grammar Error Detection Method for English Composition Based on Machine Learning
Sen et al. Chinese automatic text simplification based on unsupervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant