CN111898343A - Similar topic identification method and system based on phrase structure tree - Google Patents

Similar topic identification method and system based on phrase structure tree Download PDF

Info

Publication number
CN111898343A
CN111898343A CN202010765054.2A CN202010765054A CN111898343A CN 111898343 A CN111898343 A CN 111898343A CN 202010765054 A CN202010765054 A CN 202010765054A CN 111898343 A CN111898343 A CN 111898343A
Authority
CN
China
Prior art keywords
phrase structure
structure tree
information
tree
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010765054.2A
Other languages
Chinese (zh)
Other versions
CN111898343B (en
Inventor
陈鹏鹤
卢宇
余胜泉
刘杰飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Normal University
Original Assignee
Beijing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Normal University filed Critical Beijing Normal University
Priority to CN202010765054.2A priority Critical patent/CN111898343B/en
Publication of CN111898343A publication Critical patent/CN111898343A/en
Application granted granted Critical
Publication of CN111898343B publication Critical patent/CN111898343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B7/00Electrically-operated teaching apparatus or devices working with questions and answers
    • G09B7/02Electrically-operated teaching apparatus or devices working with questions and answers of the type wherein the student is expected to construct an answer to the question which is presented or wherein the machine gives an answer to the question presented by a student
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a similar topic identification method and a system based on a phrase structure tree, which comprises the following steps: s1, performing text preprocessing on input questions; s2, constructing a phrase structure tree aiming at the question information; and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similar situation of the subject according to the tree structure information of the phrase structure tree and the content information of the leaf nodes. The method mainly aims at the problem of comparing and identifying the similar subjects of the primary and secondary schools, a phrase structure tree is constructed for the subjects to be compared, and then the similar conditions of the subjects are evaluated through comparison of the phrase structure tree, so that the redundancy of the subject library is reduced.

Description

Similar topic identification method and system based on phrase structure tree
Technical Field
The invention relates to the technical field of education, in particular to a similar topic identification method and system based on a phrase structure tree.
Background
The question data is an important component in educational resources, and in the learning and teaching processes, exercise questions and examination questions used for testing daily used by students belong to the question data. With the development of computer and internet technology, the subject data in primary and secondary school education basically realizes electronic storage. The question data can help students deepen learning and understanding of knowledge in the learning process, and can help teachers grasp knowledge mastering conditions of the students in time, learning progress of the students is known, the students are helped to check gaps and fill up the gaps, and learning efficiency is improved.
For the construction of the multidisciplinary question bank of the primary and secondary schools, on one hand, the multidisciplinary question data of the primary and secondary schools are convenient to update and manage, and on the other hand, the teaching working strength of teachers can be reduced. With the continuous updating and increasing of the theme data in the theme library, the same or similar situations of two or more themes can occur in the theme library. On one hand, the appearance of the same or similar questions makes the question bank become redundant and huge, and more storage and calculation resources are consumed; on the other hand, the retrieval and use efficiency of the question bank data can be influenced.
It is therefore necessary to screen the question banks for topics and remove the same or similar topics. In the task of identifying similar topics, evaluating and calculating the similarity of two topics is the most important one of the two tasks. The current topic similarity calculation method is mainly to treat the topics to be compared as two continuous character strings. One way is to evaluate the similarity of the questions by distance measurement of character strings, such as calculating the cosine included angle or Euclidean distance between two vectors after the characters are expressed as the vectors; another way is to reduce the dimension of the text, such as generating a SimHash value, i.e. fingerprint (fingerprint), for the character string, and evaluating the similarity of the two character strings by the SimHash value.
It should be noted that, in the above methods, all subjects are treated as a whole character string, and in practice, a complete subject often includes different expression forms, such as some common character expressions and some formula expressions. If the whole topic is simply processed according to the character string, the similar situation of the topic cannot be accurately evaluated. In addition, although some titles have the same characters, different sentence structures cause different title information to be represented, and actually different titles exist. Such as the inverse of "-3" and the inverse of "-3". Therefore, a method for determining whether the titles are the same more accurately is needed. The phrase structure tree is a structure which can well represent key positions and key information in sentences.
Disclosure of Invention
Aiming at the problems, the invention provides a similar topic identification method and a system based on a phrase structure tree, which are used for performing text preprocessing on topic data, analyzing knowledge point information and formula information related to topics, then constructing the phrase structure tree aiming at the topic information, performing pruning processing on the constructed phrase structure tree, then performing hierarchical traversal, and comparing the structure information of the tree and leaf node content information to further realize the comparison of the similarity between two topics.
According to one aspect of the invention, a similar topic identification method based on a phrase structure tree is provided, which comprises the following steps:
s1, performing text preprocessing on input questions;
s2, constructing a phrase structure tree aiming at the question information;
and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the subjects according to the tree structure and the leaf node content of the phrase structure tree.
2. The method according to claim 1, wherein in step S1, the text pre-processing is performed on the input topic, and comprises:
s11, carrying out unified coding processing, segmenting words, removing stop words, and removing useless and illegal characters so as to obtain a word sequence;
s12, analyzing and identifying knowledge point information related to the question according to the keywords in the question;
and S13, analyzing formula expression information in the title according to the regular expression.
3. The method according to claim 2, wherein in step S2, the step of constructing the phrase structure tree for topic information comprises:
s21, performing lexical analysis on the word sequences;
s22, carrying out syntactic analysis on the word sequence;
and S23, constructing a phrase structure tree according to the results of the lexical analysis and the syntactic analysis.
4. The method according to claim 1, wherein in the step S3, the pruning step comprises:
s31, pruning the inserted words;
and S32, pruning words without practical significance.
5. The method according to claim 4, wherein the step of determining the similarity of topics in step S3 comprises:
s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, determining that the topics are different, otherwise, entering a step S34;
and S34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the titles are different, and otherwise, judging that the titles are the same.
6. The method according to claim 5, wherein in step S34, the step of comparing the content information of the phrase structure tree comprises:
comparing whether the knowledge point information related to the questions is the same or not, and if not, judging that the questions are different;
comparing whether formula expressions contained in the phrase structure tree are the same or not, and if not, judging that the questions are different;
setting different weight values for the part of speech, calculating the similarity of the two phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.
7. The method according to claim 6, wherein the similarity is calculated by the formula:
Figure BDA0002614212030000031
wherein wiThe weight of the part of speech corresponding to the ith word in the leaf node of the phrase structure tree, ciIs the comparison result of the ith participle of the two phrase structure trees, if the ith participle is the same c i1, otherwise ci=0。
8. The similar topic identification system based on the phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judgment module, wherein:
the question text preprocessing module is used for reading question information to be compared and question information of a question bank, performing corresponding text preprocessing on the question text, analyzing knowledge point information and formula expression information in the question, and finally transmitting the question information to the phrase structure tree building module;
the short language structure tree building module is used for performing lexical analysis and syntactic analysis on the question according to the question information acquired by the question text preprocessing module, building a short language structure tree by combining knowledge point information and formula expression information in the question and transmitting the short language structure tree to the question judging module;
the question judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the questions to be compared, then traversing the phrase structure tree hierarchically, judging the similarity of the questions according to the tree structure information of the phrase structure tree and the question content information and carrying out corresponding processing on the questions;
9. the system according to claim 8, wherein in the topic text preprocessing module, the method for preprocessing the topic text comprises;
unified coding processing, word segmentation, removal of stop words, removal of useless and illegal characters and obtaining of word sequences;
analyzing and identifying knowledge point information related in the questions according to keywords in the questions;
and analyzing formula expression information in the title according to the regular expression.
10. The system according to claim 8, wherein the title determination module determines similarity of titles according to tree structure information of the phrase structure tree and title content information, and comprises:
comparing whether the knowledge point information related to the questions is the same or not, and if not, judging that the questions are different;
comparing whether formula expressions contained in the phrase structure tree are the same or not, and if not, judging that the questions are different;
setting different weight values for the part of speech, calculating the similarity between phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.
The invention has the beneficial effects that:
(1) aiming at the problem characterization in the similar problem comparison process, the structural analysis is carried out on the problem by using the phrase structure tree, so that the structural characterization of the fine granularity of the problem description is realized.
(2) Aiming at similar topic comparison, the invention prunes the phrase structure tree on the basis of the representation of the phrase structure tree, refines the main part of the phrase structure tree, compares the phrase structure tree and compares the topics on the topic structure layer.
(3) Aiming at the comparison of similar topics, the accuracy of similarity judgment is improved by comparing the fine granularity of knowledge point information, formula information and specific text information contained in the comparison topics on the basis of the phrase structure tree comparison.
Drawings
FIG. 1 is a flow chart illustrating a similar topic identification method based on a phrase structure tree according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a topic text preprocessing method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for constructing a phrase structure tree for topics according to an embodiment of the present invention;
FIG. 4 is a diagram of a phrase structure tree;
FIG. 5 is a diagram of a phrase structure tree;
FIG. 6 is a diagram of a phrase structure tree;
FIG. 7 is a flow chart illustrating topic similarity determination according to one embodiment of the present invention;
FIG. 8 is a diagram of a phrase structure tree;
FIG. 9 is a diagram of a phrase structure tree;
FIG. 10 is a diagram of a phrase structure tree;
FIG. 11 is a diagram of a phrase structure tree;
FIG. 12 is a diagram illustrating a structure of a similar topic identification system based on a phrase structure tree according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, a technical solution in an embodiment of the present invention will be described in detail below with reference to the accompanying drawings in the embodiment of the present invention, where the embodiment described below is a part of the embodiment of the present invention, but not all of the embodiment. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The phrase structure tree is to output the result of the phrase structure analysis of the sentence in a tree structure, that is, for each input sentence, the analysis is completed by constructing a phrase tree, and the phrase structure tree can not only show the grammatical relationship of the sentence, but also show the hierarchy of the sentence. The phrase structure between sentences can be analyzed quickly from the phrase structure tree, for example, the node label NP indicates that the part is a noun phrase. In the phrase structure tree, when the nearest father node of two phrases belongs to the same node, the two phrases are called as the same-level phrases, and besides, the phrase structure tree can also analyze the parallel structure, the clause structure and the like in the sentence.
The invention is described in detail below with reference to the figures and the detailed description. According to an aspect of the present invention, a similar topic identification method based on a phrase structure tree is provided, as shown in fig. 1, including the following steps:
s1, performing text preprocessing on input questions;
s2, constructing a phrase structure tree based on the title information;
and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity condition of the subjects according to the tree structure of the phrase structure tree and the similarity of the leaf node contents.
In step S1, the presentation forms of the titles often differ depending on the storage method and the application environment. For different display requirements of topics, the coding modes of the topics are different in the situations of GB2312, GBK, UTF-8 and the like. Therefore, uniform text preprocessing needs to be performed on the questions to be compared, so that similarity comparison can be performed on the questions subsequently, and the accuracy of the similarity comparison is improved.
As shown in fig. 2, the text preprocessing operation mainly includes the following operations:
(1) and analyzing knowledge point information related in the title according to the keywords in the title.
(2) Identifying formula expression information in the topic information through the regular expression.
(3) Unified coding processing: unifying the title coding format to UTF-8;
Figure BDA0002614212030000061
for character normalization, for example, characters such as "4", "a" may exist in the title, and are normalized to "4", "a";
Figure BDA0002614212030000071
convert various types of spaces to chinese spaces, and convert various types of punctuation marks to chinese punctuation marks, such as the english "? "convert to chinese"? ";
Figure BDA0002614212030000072
uniformly converting English characters in the title into a lower case format;
(4) word segmentation:
Figure BDA0002614212030000073
chinese characters of the subject content need to be segmented firstly;
Figure BDA0002614212030000074
after the topic is divided into words, the topic content is converted into a sequence represented by words with separated spaces;
(5) removing stop words:
Figure BDA0002614212030000075
in order to improve the accuracy of topic similarity comparison, some words which are less important to the topic can be removed. A common stop vocabulary is used here.
(6) Removal of useless and illegal symbols:
Figure BDA0002614212030000076
remove the empty brackets in the title and the condition that there are one or more blank spaces in the brackets, such as: "()", "()" etc.;
Figure BDA0002614212030000077
removing redundant or mismatched symbols at the end of the topic description such as: is "? Is there a "," ("," [ "," { "etc.;
Figure BDA0002614212030000078
removing only the nonsense cases with sequence numbers in the title description such as "A, B, CD", "A, B, C, D", "A, B, C", etc.
Figure BDA0002614212030000079
Removing line feed symbols, tabulation symbols, underliningLines and illegal characters such as "□", "\ xa 0", "\ xc 2", "\ x0 b", "\ x0 c", "\ x0 d" \ x0f ", etc.;
Figure BDA00026142120300000710
removing characters out of the character set, such as emoji symbols and the like which cannot be normally displayed, which are messy code characters;
these operations are not in chronological order, and those skilled in the art can set their specific execution steps as required.
The knowledge point refers to a general term for a certain knowledge, and particularly refers to the knowledge on a textbook or an examination. For example, the problem of calculating the length of the oblique angle side of a known right-angled triangle belongs to the knowledge point of the Pythagorean theorem. And matching in a predefined knowledge point library according to the keywords in the topic information to acquire knowledge point information related to the topic.
The knowledge point library comprises information of each knowledge point appearing in primary and middle schools and related descriptive keywords (the keywords are used for describing specific knowledge under the knowledge points, for example, the "three solution function" of the knowledge points comprises keywords such as "arbitrary angle", "radian system", "sine", "cosine", "tangent", and the like), and the structure and the example information of the knowledge point library are shown in table 1.
Table 1 knowledge points base structure and examples
Figure BDA0002614212030000081
In one embodiment, the input topics are as follows:
title: equation x2The opposite of the root of +6 x +9 ═ 0 is ()?
Uniformly converting English brackets into Chinese brackets through text preprocessing; removing the blank space; the empty brackets are removed.
Analyzing the question information to obtain a formula expression of the question as follows: x is the number of2+6*x+9=0
And matching the topic information to obtain the knowledge point information of the keyword 'equation' in the topic, which belongs to the knowledge point database, as 'function and equation'.
The topic information is the word sequence, knowledge point information and formula information generated after the processing.
In step S2, as shown in fig. 3, the phrase structure tree is constructed for the topic information, and the following steps are mainly included.
The phrase structure tree is constructed by performing lexical analysis and syntactic analysis on word sequences in topic information on the basis of word segmentation, then constructing the phrase structure tree, representing the relation between words by using each tree node, taking the content of leaf nodes as the segmentation in the topic information, and further converting the topic information into the phrase structure tree for representation.
In computer science and technology, a phrase structure tree is a data structure used to express the syntactic structure of a sentence. We apply this idea to the processing of topic information, building the topic information into a tree structure representation, where leaf nodes are associated with words in the input sentence, and other intermediate node contents are labels of the short language components. If NP represents the phrase as a noun phrase, VP represents the phrase as a verb phrase. The construction steps of the short language structure tree mainly comprise lexical analysis and syntactic analysis.
And the lexical analysis is a process of matching lexical rules to read-in character strings, and comprises the steps of scanning texts to be analyzed from left to right character by character, analyzing and classifying parts of speech in a segmentation result based on 12 types of words, and determining the lexical rules. The part-of-speech categories are nouns, verbs, adjectives, numerators, quantifiers, pronouns, adverbs, prepositions, conjunctions, auxiliary words, sighs and pronouns respectively.
And the grammar analysis is to take the character stream after word segmentation as input and identify whether the word segmentation sequence given by the lexical analysis is a sentence which accords with grammar rules. Modern Chinese grammar has multiple sentence structure, such as a chief and predicate structure, a bingo structure, etc., and the grammar analysis mainly analyzes sentence structure information in question information.
Through the phrase structure tree, the relation of each part between sentences can be clearly cleared. Wherein the relationship represented by the nodes in the phrase structure tree and their meanings are shown in Table 2:
TABLE 2 expression of nodes in the phrase Structure Tree and meanings
Figure BDA0002614212030000091
Figure BDA0002614212030000101
For example, the phrase "lovely classmates sit on flying high-speed rail", the phrase structure tree structure is constructed as shown in fig. 4. It is expressed as follows in terms of a tree structure:
[ S [ VP [ CP [ ADJP lovely ] [ NP student ] ] [ VV sit ] ] [ NP [ CP [ VP [ flying ] [ NN high-speed iron ] ] [ LC ] ] ]
For another example, topic information: "equation x2The opposite number of the root of +6 × x +9 ═ 0 is ", the phrase structure tree structure is constructed as shown in fig. 5, and the phrase structure tree store indicates: [ S [ NP [ NN equation ]][NR x^2+6*x+9=0][ of DNP]][ NP [ NN root ]][ of DNP][ NN inverse number]][ VV ] is]]
The formula information is as follows: x 2+6 x +9 ═ 0
The related knowledge point information is as follows: functions and equations.
In another example, the topic information: the inverse of the reciprocal of "-3" is constructed as a phrase structure tree structure as shown in fig. 6. The phrase structure tree store representation: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN inverse of DNP ] ] [ VV is ] ]
The formula information is as follows: -3
The related knowledge point information is as follows: functions and equations.
In step S3, a pruning operation is performed on the phrase structure tree, the phrase structure tree is traversed, and the similarity of topics is determined according to the tree structure information of the phrase structure tree and the topic content information.
As shown in fig. 7, the determination of similarity of topics based on the phrase structure tree mainly includes two operations, namely, pruning the phrase structure tree, traversing the phrase structure tree and comparing tree structure information of the phrase structure tree, and then comparing content information of the phrase structure tree, including comparison of knowledge point information related to the topics with formula information and topic specific content information. The method comprises the following specific steps:
(1) pruning the phrase structure tree:
and pruning the phrase structure tree, including pruning inserted words and pruning unrealistic words such as language words, sound-like words and sentence break nodes. The insertion language belongs to an independent language in the sentence, and the sentence can be simplified by removing the insertion language. The words without practical meaning do not contain or contain a small amount of semantic information in the sentence, and the meaning expressed by the sentence is not influenced after the words without practical meaning are removed.
The part marked PRN in the phrase structure tree is the insertion, and we prune the insertion part, delete all its children nodes and then merge the rest parts together. The part marked as Y in the phrase structure tree is a tone word, the part marked as O is a pseudonym, the part marked as PU is a sentence break node, the marked parts are pruned, all the child nodes are deleted, and then the rest parts are combined together. The phrase structure tree before pruning is shown in fig. 8, and the phrase structure tree store is represented as:
[ NP [ NN Xiaoming ] [ VP [ VV have ] [ QP [ CD three ] [ M ] ] [ NN rabbit ] [ PU ] ] ] ] [ VP [ D and ] [ VV get ] [ QP [ CD two ] [ M ] ] [ PU ] ] ] [ VP [ P is shared ] [ QP [ CD several ] [ M ] ] [ PU ]? ]]]
Post-pruning phrase structure tree as shown in fig. 9, the phrase structure tree store is represented as:
[ S [ NP [ NN Xiaoming ] [ VP [ VV have ] [ QP [ CD three ] [ M ] ] [ NN Rabbit ] ] ] [ VP [ VV get ] [ QP [ CD two ] [ M ] ] ] [ VP [ P common ] [ QP [ CD several ] [ M ] ] ] ] ]
(2) Traversing the phrase structure tree:
for the traversal of the phrase structure tree, a tree hierarchy traversal method is adopted, and a specific algorithm is described as follows:
initializing a queue Q, and adding a root node S of the phrase structure tree into the queue;
while queue Q is not empty:
taking out the head node element of the queue Q;
accessing the node value;
if the child node of the node is not empty and the child node is not a leaf node, the child node is added to the queue.
(3) Comparing the tree structure information of the phrase structure tree:
comparing the tree structure information of the phrase structure tree of the question, if the tree structure information of the phrase structure tree is different, judging that the question is different, otherwise, continuously comparing the content information of the phrase structure tree;
the specific comparison process is as follows:
two phrase structure trees T to be compared in the hierarchical traversal of the phrase structure trees1And T2Firstly, initializing two queues P and Q, and firstly, setting the root node S of two phrase structure trees1And S2Adding the queues P and Q respectively, and then taking out the head nodes of the two queues, namely S1And S2Making a comparison if S1And S2Content of (1) and S1And S2Content C of subtree node1And C2All are the same, then the subtree node C is connected1And C2Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.
And after one round of comparison is finished, judging whether the two queues P and Q are empty, if the two queues are not empty, continuously taking out the head node from the queues, and continuously performing the comparison. And if one of the two queues is empty and the other queue is not empty, judging that the structures of the two phrase structure trees are different. And if the two queues are empty, the structure comparison of the phrase structure tree is finished.
(4) Comparing the title content information of the phrase structure tree:
comparing title content information of the phrase structure tree, wherein the method comprises the following steps:
firstly, comparing whether the knowledge point information related to the two questions is the same or not, and if not, judging that the questions are different; if the knowledge point information is the same, continuously comparing whether formula expressions contained in the phrase structure tree are the same, and if not, judging that the questions are different; and if the formula expression information is the same, comparing the specific content information of the topics. In the comparison of the topic contents, different weight values are set for the part of speech categories, then the similarity of the two phrases is calculated, if the similarity is larger than a set threshold value, the topics are judged to be the same, and if not, the topics are judged to be different. The calculation formula of the similarity score is as follows:
Figure BDA0002614212030000121
wherein wiThe weight of the part of speech corresponding to the ith word in the leaf node of the phrase structure tree, ciIs the comparison result of the ith participle of the two phrase structure trees, if the ith participle is the same c i1, otherwise ci=0。
As compared with the title "the four inventions make a prominent contribution to the world" and the title "yaoming makes a great contribution to the sports world", the noun part is { the four inventions, yaoming, the world, the sports world, and the contribution }, the verb part is { the pair, the giving, and the making }, and the adjective is { the prominent, great }, which is specifically expressed as follows:
Figure BDA0002614212030000131
part of speech: verb noun, verb noun, adjective noun
In one embodiment, the noun, verb, and adjective are weighted to 0.2, 0.3, and 0.1, and the threshold is set to 0.8, then
Figure BDA0002614212030000132
Part of speech: verb noun, verb noun, adjective noun
Figure BDA0002614212030000133
Therefore, score (0.2 × 0+0.3 × 0+0.2 × 0+0.3 × 1+0.1 × 0+0.2 × 1)/(0.2 × 1+0.3 × 1+0.1 + 1+0.2 × 1) is 0.4167, which is smaller than the set threshold of 0.4167, and thus, the contents of the two phrases are determined to be different, i.e., the titles are not the same.
In another example, two topics to be compared are specified below:
topic 1: equation x2The inverse of the root of +6 x +9 ═ 0 is
Topic 2: the inverse of the reciprocal of-3 is
The constructed phrase structure tree is represented as:
topic 1: [ S [ NP [ NN equation ] [ NR x ^2+6 x +9 ═ 0] [ DNP ] ] [ NP [ NN root ] [ NN inverse of DNP ] ] [ VV is ] ]
Topic 2: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN inverse of DNP ] ] [ VV is ] ]
The phrase structure tree is pruned first, and no part which can be pruned is found. Then comparing the structure of the phrase structure tree, initializing two queues P and Q in a hierarchical traversal mode, and setting the root nodes S of the two trees1And S2Respectively adding the data into queues P and Q, wherein the queues P and Q are not empty, and adding the head node S1And S2And taking out, wherein the contents of the two nodes are 'S', and three subtree nodes 'NP', 'VV' are provided, and the contents of the subtree nodes are the same. Therefore, subtree nodes "NP", "NP", "VV" are added to queues P and Q, respectively. At this time, the queues P and Q are not empty, the head node "NP" of the two queues is taken out, and it is found that the subtree of topic 1 has three nodes, which are respectively "NN", "NR" and "DNP", while the subtree of topic 2 has only two nodes, "NR" and "DNP", so that it is determined that the structures of the two phrase structure trees are different, and further, it is determined that the two topics are not similar.
In another example, two topics to be compared are specified below:
topic 3: the authors of the Western notes complied with the resistance spirit
Topic 4: the author of the western shorthand tells what story
The tree structure is shown in fig. 10 and 11. The constructed phrase structure tree structure is as follows:
topic 3: [ S [ NP [ NP [ NN West note ] [ DNP ] ] [ NN author ] ] [ VP [ VV praise ] [ AS ] ] [ NP [ NN reactance ] [ NN spirit ] ] ]
Topic 4: [ S [ NP [ NP [ NN Western notation ] [ DNP ] ] [ NN author ] ] [ VP [ VV lecture ] [ AS ] ] [ NP [ what [ PN ] [ NN story ] ] ]
The phrase structure tree is pruned first, and no part which can be pruned is found. Then comparing the structure of phrase structure tree, initializing two queues P and Q in hierarchical traversal mode, and setting the root nodes S of two trees1And S2Respectively adding the data into queues P and Q, wherein the queues P and Q are not empty, and adding the head node S1And S2Taking out, the contents of the two nodes are 'S', and three subtree nodes 'NP', 'VP' and 'NP' are provided, and the contents of the subtree nodes are the same. Therefore, the subtree nodes NP, VP and NP are added into the queue respectively. At this time, queues P and Q are not empty, a head node NP of the two queues is taken out, the subtree of the topic 3 is found to be the same as the subtree of the topic 4, and subtree nodes NN and DNP are added into the queues. And continuously taking out the head nodes 'VP' in the queues P and Q, comparing child nodes thereof, and finding that the child nodes are 'VV' and 'AS'. And continuously taking out the queue P and the Q child node NP, comparing the child nodes, finding that the child nodes of the topic 3 are NN, and the child nodes of the topic 4 are PN and NN which are different, so that the structure of the two phrase structure trees is judged to be different, and the two topics are judged to be different.
According to another aspect of the present invention, a similar topic identification system based on a phrase structure tree is provided, which includes: a topic text preprocessing module, a phrase structure tree building module and a topic determination module, as shown in fig. 12.
The question text preprocessing module is used for reading question information to be compared and question information of a question bank, performing corresponding text preprocessing on the question text, analyzing knowledge point information, formula expression information and question information in the question and transmitting the knowledge point information, the formula expression information and the question information to the phrase structure tree building module; the specific procedures are as described above.
The short language structure tree building module is used for performing lexical analysis and syntactic analysis on the question according to the question information acquired by the question text preprocessing module, building a short language structure tree by combining knowledge point information and formula expression information in the question and transmitting the short language structure tree to the question judging module; the specific procedures are as described above.
And the title judging module firstly performs pruning operation on the phrase structure tree according to the phrase structure tree information of the titles to be compared, then traverses the phrase structure tree hierarchically, judges the similar situation of the titles according to the tree structure information of the phrase structure tree and the title content information, and performs corresponding processing on the titles.
In the title judging module, the phrase structure tree is pruned, the phrase structure tree is traversed, the tree structure information of the phrase structure tree is compared, and then the content information of the phrase structure tree is compared, wherein the comparison comprises the comparison of knowledge point information related to a title with formula information and specific title content information. The method comprises the following specific steps:
(1) pruning the phrase structure tree:
and pruning the phrase structure tree, including pruning inserted words and pruning unrealistic words such as language words, sound-like words and sentence break nodes. The insertion language belongs to an independent language in the sentence, and the sentence can be simplified by removing the insertion language. The words without practical meaning do not contain or contain a small amount of semantic information in the sentence, and the meaning expressed by the sentence is not influenced after the words without practical meaning are removed.
The part marked PRN in the phrase structure tree is the insertion, and we prune the insertion part, delete all its children nodes and then merge the rest parts together. The part marked as Y in the phrase structure tree is a tone word, the part marked as O is a pseudonym, the part marked as PU is a sentence break node, the marked parts are pruned, all the child nodes are deleted, and then the rest parts are combined together.
(2) Traversing the phrase structure tree:
for the traversal of the phrase structure tree, a tree hierarchy traversal method is adopted, and a specific algorithm is described as follows:
initializing a queue Q, and adding a root node S of the phrase structure tree into the queue;
while queue Q is not empty:
taking out the head node element of the queue Q;
accessing the node value;
if the child node of the node is not empty and the child node is not a leaf node, the child node is added to the queue.
(3) Comparing the tree structure information of the phrase structure tree:
comparing the tree structure information of the phrase structure tree of the question, if the tree structure information of the phrase structure tree is different, judging that the question is different, otherwise, continuously comparing the content information of the phrase structure tree;
the specific comparison process is as follows:
two phrase structure trees T to be compared in the hierarchical traversal of the phrase structure trees1And T2Firstly, initializing two queues P and Q, and firstly, setting the root node S of two phrase structure trees1And S2Adding the queues P and Q respectively, and then taking out the head nodes of the two queues, namely S1And S2Making a comparison if S1And S2Content of (1) and S1And S2Content C of subtree node1And C2All are the same, then the subtree node C is connected1And C2Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.
And after one round of comparison is finished, judging whether the two queues P and Q are empty, if the two queues are not empty, continuously taking out the head node from the queues, and continuously performing the comparison. And if one of the two queues is empty and the other queue is not empty, judging that the structures of the two phrase structure trees are different. And if the two queues are empty, the structure comparison of the phrase structure tree is finished.
(4) Comparing the title content information of the phrase structure tree:
comparing title content information of the phrase structure tree, wherein the method comprises the following steps:
firstly, comparing whether the knowledge point information related to the two questions is the same or not, and if not, judging that the questions are different; if the knowledge point information is the same, continuously comparing whether formula expressions contained in the phrase structure tree are the same, and if not, judging that the questions are different; and if the formula expression information is the same, comparing the specific content information of the topics. In the comparison of the topic contents, different weight values are set for the part of speech categories, then the similarity of the two phrases is calculated, if the similarity is larger than a set threshold value, the topics are judged to be the same, and if not, the topics are judged to be different. The calculation formula of the similarity score is as follows:
Figure BDA0002614212030000171
wherein wiThe weight of the part of speech corresponding to the ith word in the leaf node of the phrase structure tree, ciIs the comparison result of the ith participle of the two phrase structure trees, if the ith participle is the same c i1, otherwise ci=0。
By the method or the system, the questions in the question bank can be compared one by one, so that the questions with the same or high similarity are deleted, the redundancy of the question bank is reduced, and the quality of the question bank is improved.
Technical contents not described in detail in the present invention belong to the well-known techniques of those skilled in the art.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (10)

1. A similar topic identification method based on a phrase structure tree is characterized by comprising the following steps:
s1, performing text preprocessing on input questions;
s2, constructing a phrase structure tree aiming at the question information;
and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the subjects according to the tree structure and the leaf node content of the phrase structure tree.
2. The method according to claim 1, wherein in step S1, the text pre-processing is performed on the input topic, and comprises:
s11, carrying out unified coding processing, segmenting words, removing stop words, and removing useless and illegal characters so as to obtain a word sequence;
s12, analyzing and identifying knowledge point information related to the question according to the keywords in the question;
and S13, analyzing formula expression information in the title according to the regular expression.
3. The method according to claim 2, wherein in step S2, the step of constructing the phrase structure tree for topic information comprises:
s21, performing lexical analysis on the word sequences;
s22, carrying out syntactic analysis on the word sequence;
and S23, constructing a phrase structure tree according to the results of the lexical analysis and the syntactic analysis.
4. The method according to claim 1, wherein in the step S3, the step of pruning comprises:
s31, pruning the inserted words;
and S32, pruning words without practical significance.
5. The method according to claim 4, wherein the step of determining the similarity of topics in step S3 comprises:
s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, judging that the topics are different, and if not, entering a step S34;
and S34, comparing whether the content information of the phrase structure tree is the same or not, if not, judging that the titles are different, and otherwise, judging that the titles are the same.
6. The method according to claim 5, wherein in step S34, the step of comparing the content information of the phrase structure tree comprises:
comparing whether the knowledge point information related to the questions is the same or not, and if not, judging that the questions are different;
comparing whether formula expressions contained in the phrase structure tree are the same or not, and if not, judging that the questions are different;
setting different weight values for the part of speech categories, calculating the similarity of the two phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.
7. The method according to claim 6, wherein the similarity is calculated by the formula:
Figure FDA0002614212020000021
wherein wiThe weight of the part of speech corresponding to the ith word in the leaf node of the phrase structure tree, ciIs the comparison result of the ith participle of the two phrase structure trees, if the ith participle is the same ci1, otherwise ci=0。
8. The similar topic identification system based on the phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judgment module, wherein:
the question text preprocessing module is used for reading question information to be compared and question information of a question bank, performing corresponding text preprocessing on the question text, analyzing knowledge point information and formula expression information in the question, and finally transmitting the question information to the phrase structure tree building module;
the phrase structure tree building module is used for performing lexical analysis and syntactic analysis on the questions according to the question information acquired by the question text preprocessing module, building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module;
and the question judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the questions to be compared, then traversing the phrase structure tree hierarchically, judging the similarity of the questions according to the tree structure information of the phrase structure tree and the question content information and carrying out corresponding processing on the questions.
9. The system according to claim 8, wherein in the topic text preprocessing module, the method for preprocessing the topic text comprises;
unified coding processing, word segmentation, removal of stop words, removal of useless and illegal characters and obtaining of word sequences;
analyzing and identifying knowledge point information related in the questions according to keywords in the questions;
and analyzing formula expression information in the title according to the regular expression.
10. The system according to claim 8, wherein the title determination module determines similarity of titles according to tree structure information of the phrase structure tree and title content information, and comprises:
comparing whether the knowledge point information related to the questions is the same or not, and if not, judging that the questions are different;
comparing whether formula expressions contained in the phrase structure tree are the same or not, and if not, judging that the questions are different;
setting different weight values for the part of speech, calculating the similarity between phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.
CN202010765054.2A 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree Active CN111898343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010765054.2A CN111898343B (en) 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010765054.2A CN111898343B (en) 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree

Publications (2)

Publication Number Publication Date
CN111898343A true CN111898343A (en) 2020-11-06
CN111898343B CN111898343B (en) 2023-07-14

Family

ID=73184054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010765054.2A Active CN111898343B (en) 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree

Country Status (1)

Country Link
CN (1) CN111898343B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006190101A (en) * 2005-01-06 2006-07-20 Csk Holdings Corp Natural language analysis device, method and program
US20070073745A1 (en) * 2005-09-23 2007-03-29 Applied Linguistics, Llc Similarity metric for semantic profiling
JP2007286721A (en) * 2006-04-13 2007-11-01 Nippon Hoso Kyokai <Nhk> Similarity evaluation device and program
EP2439684A2 (en) * 2010-10-06 2012-04-11 The Chancellor, Masters and Scholars of the University of Cambridge Automated assessment of examination scripts
CN105335528A (en) * 2015-12-01 2016-02-17 中国计量学院 Customized product similarity judgment method based on product structure
WO2016179938A1 (en) * 2015-05-14 2016-11-17 百度在线网络技术(北京)有限公司 Method and device for question recommendation
CN106651696A (en) * 2016-11-16 2017-05-10 福建天泉教育科技有限公司 Approximate question push method and system
CN107818082A (en) * 2017-09-25 2018-03-20 沈阳航空航天大学 With reference to the semantic role recognition methods of phrase structure tree
CN108334493A (en) * 2018-01-07 2018-07-27 深圳前海易维教育科技有限公司 A kind of topic knowledge point extraction method based on neural network
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN108509484A (en) * 2018-01-31 2018-09-07 腾讯科技(深圳)有限公司 Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing
CN109872162A (en) * 2018-11-21 2019-06-11 阿里巴巴集团控股有限公司 A kind of air control classifying identification method and system handling customer complaint information
CN109947836A (en) * 2019-03-21 2019-06-28 江西风向标教育科技有限公司 English paper structural method and device
CN110222678A (en) * 2019-04-30 2019-09-10 宜春宜联科技有限公司 A kind of item analysis method, system, readable storage medium storing program for executing and electronic equipment
CN110853422A (en) * 2018-08-01 2020-02-28 世学(深圳)科技有限公司 Immersive language learning system and learning method thereof
CN111241239A (en) * 2020-01-07 2020-06-05 科大讯飞股份有限公司 Method for detecting repeated questions, related device and readable storage medium
CN111259632A (en) * 2020-02-10 2020-06-09 暗物智能科技(广州)有限公司 Semantic alignment-based tree structure mathematical application problem solving method and system

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006190101A (en) * 2005-01-06 2006-07-20 Csk Holdings Corp Natural language analysis device, method and program
US20070073745A1 (en) * 2005-09-23 2007-03-29 Applied Linguistics, Llc Similarity metric for semantic profiling
JP2007286721A (en) * 2006-04-13 2007-11-01 Nippon Hoso Kyokai <Nhk> Similarity evaluation device and program
EP2439684A2 (en) * 2010-10-06 2012-04-11 The Chancellor, Masters and Scholars of the University of Cambridge Automated assessment of examination scripts
WO2016179938A1 (en) * 2015-05-14 2016-11-17 百度在线网络技术(北京)有限公司 Method and device for question recommendation
CN105335528A (en) * 2015-12-01 2016-02-17 中国计量学院 Customized product similarity judgment method based on product structure
CN106651696A (en) * 2016-11-16 2017-05-10 福建天泉教育科技有限公司 Approximate question push method and system
CN107818082A (en) * 2017-09-25 2018-03-20 沈阳航空航天大学 With reference to the semantic role recognition methods of phrase structure tree
CN108334493A (en) * 2018-01-07 2018-07-27 深圳前海易维教育科技有限公司 A kind of topic knowledge point extraction method based on neural network
CN108345468A (en) * 2018-01-29 2018-07-31 华侨大学 Programming language code duplicate checking method based on tree and sequence similarity
CN108509484A (en) * 2018-01-31 2018-09-07 腾讯科技(深圳)有限公司 Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing
CN110853422A (en) * 2018-08-01 2020-02-28 世学(深圳)科技有限公司 Immersive language learning system and learning method thereof
CN109872162A (en) * 2018-11-21 2019-06-11 阿里巴巴集团控股有限公司 A kind of air control classifying identification method and system handling customer complaint information
CN109947836A (en) * 2019-03-21 2019-06-28 江西风向标教育科技有限公司 English paper structural method and device
CN110222678A (en) * 2019-04-30 2019-09-10 宜春宜联科技有限公司 A kind of item analysis method, system, readable storage medium storing program for executing and electronic equipment
CN111241239A (en) * 2020-01-07 2020-06-05 科大讯飞股份有限公司 Method for detecting repeated questions, related device and readable storage medium
CN111259632A (en) * 2020-02-10 2020-06-09 暗物智能科技(广州)有限公司 Semantic alignment-based tree structure mathematical application problem solving method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ARTUUR LEEUWENBERG等: "Exploring Pattern Structures of Syntactic Trees for Relation Extraction", 《LECT NOTES ARTIF INT》, pages 1 - 17 *
杨风玲: "基于语义角色分析的句子相似度的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 138 - 549 *

Also Published As

Publication number Publication date
CN111898343B (en) 2023-07-14

Similar Documents

Publication Publication Date Title
Black et al. Statistically-driven computer grammars of English: The IBM/Lancaster approach
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
US11210468B2 (en) System and method for comparing plurality of documents
CN110727796A (en) Multi-scale difficulty vector classification method for graded reading materials
Svoboda et al. New word analogy corpus for exploring embeddings of Czech words
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
Venčkauskas et al. Problems of authorship identification of the national language electronic discourse
Khan et al. A clustering framework for lexical normalization of Roman Urdu
Chang et al. Automated Chinese essay scoring based on multilevel linguistic features
CN112632272B (en) Microblog emotion classification method and system based on syntactic analysis
CN110826329A (en) Automatic composition scoring method based on confusion degree
CN111898343B (en) Similar topic identification method and system based on phrase structure tree
Shekhar et al. Computational linguistic retrieval framework using negative bootstrapping for retrieving transliteration variants
Lee Natural Language Processing: A Textbook with Python Implementation
Tukur et al. Parts-of-speech tagging of Hausa-based texts using hidden Markov model
Atwell Classical and modern Arabic corpora
Talpur et al. Researching on Analysis and creating Corpus from Primary level Sindhi language Book for Sindhi
CN112085985A (en) Automatic student answer scoring method for English examination translation questions
Seresangtakul et al. Thai-Isarn dialect parallel corpus construction for machine translation
Samir et al. Training and evaluation of TreeTagger on Amazigh corpus
Althobaiti Minimally-supervised Methods for Arabic Named Entity Recognition
CN115270786B (en) Method, device and equipment for identifying question intention and readable storage medium
Kaleem et al. Word order variation and string similarity algorithm to reduce pattern scripting in pattern matching conversational agents
Abdelkader et al. How Existing NLP Tools of Arabic Language Can Serve Hadith Processing
CN112185361B (en) Voice recognition model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant