CN111898343A

CN111898343A - Similar topic identification method and system based on phrase structure tree

Info

Publication number: CN111898343A
Application number: CN202010765054.2A
Authority: CN
Inventors: 陈鹏鹤; 卢宇; 余胜泉; 刘杰飞
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2020-11-06
Anticipated expiration: 2040-08-03
Also published as: CN111898343B

Abstract

The invention provides a similar topic identification method and a system based on a phrase structure tree, which comprises the following steps: s1, performing text preprocessing on input questions; s2, constructing a phrase structure tree aiming at the question information; and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similar situation of the subject according to the tree structure information of the phrase structure tree and the content information of the leaf nodes. The method mainly aims at the problem of comparing and identifying the similar subjects of the primary and secondary schools, a phrase structure tree is constructed for the subjects to be compared, and then the similar conditions of the subjects are evaluated through comparison of the phrase structure tree, so that the redundancy of the subject library is reduced.

Description

Similar topic identification method and system based on phrase structure tree

Technical Field

The invention relates to the technical field of education, in particular to a similar topic identification method and system based on a phrase structure tree.

Background

The question data is an important component in educational resources, and in the learning and teaching processes, exercise questions and examination questions used for testing daily used by students belong to the question data. With the development of computer and internet technology, the subject data in primary and secondary school education basically realizes electronic storage. The question data can help students deepen learning and understanding of knowledge in the learning process, and can help teachers grasp knowledge mastering conditions of the students in time, learning progress of the students is known, the students are helped to check gaps and fill up the gaps, and learning efficiency is improved.

For the construction of the multidisciplinary question bank of the primary and secondary schools, on one hand, the multidisciplinary question data of the primary and secondary schools are convenient to update and manage, and on the other hand, the teaching working strength of teachers can be reduced. With the continuous updating and increasing of the theme data in the theme library, the same or similar situations of two or more themes can occur in the theme library. On one hand, the appearance of the same or similar questions makes the question bank become redundant and huge, and more storage and calculation resources are consumed; on the other hand, the retrieval and use efficiency of the question bank data can be influenced.

It is therefore necessary to screen the question banks for topics and remove the same or similar topics. In the task of identifying similar topics, evaluating and calculating the similarity of two topics is the most important one of the two tasks. The current topic similarity calculation method is mainly to treat the topics to be compared as two continuous character strings. One way is to evaluate the similarity of the questions by distance measurement of character strings, such as calculating the cosine included angle or Euclidean distance between two vectors after the characters are expressed as the vectors; another way is to reduce the dimension of the text, such as generating a SimHash value, i.e. fingerprint (fingerprint), for the character string, and evaluating the similarity of the two character strings by the SimHash value.

It should be noted that, in the above methods, all subjects are treated as a whole character string, and in practice, a complete subject often includes different expression forms, such as some common character expressions and some formula expressions. If the whole topic is simply processed according to the character string, the similar situation of the topic cannot be accurately evaluated. In addition, although some titles have the same characters, different sentence structures cause different title information to be represented, and actually different titles exist. Such as the inverse of "-3" and the inverse of "-3". Therefore, a method for determining whether the titles are the same more accurately is needed. The phrase structure tree is a structure which can well represent key positions and key information in sentences.

Disclosure of Invention

Aiming at the problems, the invention provides a similar topic identification method and a system based on a phrase structure tree, which are used for performing text preprocessing on topic data, analyzing knowledge point information and formula information related to topics, then constructing the phrase structure tree aiming at the topic information, performing pruning processing on the constructed phrase structure tree, then performing hierarchical traversal, and comparing the structure information of the tree and leaf node content information to further realize the comparison of the similarity between two topics.

According to one aspect of the invention, a similar topic identification method based on a phrase structure tree is provided, which comprises the following steps:

s1, performing text preprocessing on input questions;

s2, constructing a phrase structure tree aiming at the question information;

and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the subjects according to the tree structure and the leaf node content of the phrase structure tree.

2. The method according to claim 1, wherein in step S1, the text pre-processing is performed on the input topic, and comprises:

s11, carrying out unified coding processing, segmenting words, removing stop words, and removing useless and illegal characters so as to obtain a word sequence;

s12, analyzing and identifying knowledge point information related to the question according to the keywords in the question;

and S13, analyzing formula expression information in the title according to the regular expression.

3. The method according to claim 2, wherein in step S2, the step of constructing the phrase structure tree for topic information comprises:

s21, performing lexical analysis on the word sequences;

s22, carrying out syntactic analysis on the word sequence;

and S23, constructing a phrase structure tree according to the results of the lexical analysis and the syntactic analysis.

4. The method according to claim 1, wherein in the step S3, the pruning step comprises:

s31, pruning the inserted words;

and S32, pruning words without practical significance.

5. The method according to claim 4, wherein the step of determining the similarity of topics in step S3 comprises:

s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, determining that the topics are different, otherwise, entering a step S34;

and S34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the titles are different, and otherwise, judging that the titles are the same.

6. The method according to claim 5, wherein in step S34, the step of comparing the content information of the phrase structure tree comprises:

comparing whether the knowledge point information related to the questions is the same or not, and if not, judging that the questions are different;

comparing whether formula expressions contained in the phrase structure tree are the same or not, and if not, judging that the questions are different;

setting different weight values for the part of speech, calculating the similarity of the two phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.

7. The method according to claim 6, wherein the similarity is calculated by the formula:

wherein w_iThe weight of the part of speech corresponding to the ith word in the leaf node of the phrase structure tree, c_iIs the comparison result of the ith participle of the two phrase structure trees, if the ith participle is the same c _i1, otherwise c_i＝0。

8. The similar topic identification system based on the phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judgment module, wherein:

the question text preprocessing module is used for reading question information to be compared and question information of a question bank, performing corresponding text preprocessing on the question text, analyzing knowledge point information and formula expression information in the question, and finally transmitting the question information to the phrase structure tree building module;

the short language structure tree building module is used for performing lexical analysis and syntactic analysis on the question according to the question information acquired by the question text preprocessing module, building a short language structure tree by combining knowledge point information and formula expression information in the question and transmitting the short language structure tree to the question judging module;

the question judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the questions to be compared, then traversing the phrase structure tree hierarchically, judging the similarity of the questions according to the tree structure information of the phrase structure tree and the question content information and carrying out corresponding processing on the questions;

9. the system according to claim 8, wherein in the topic text preprocessing module, the method for preprocessing the topic text comprises;

unified coding processing, word segmentation, removal of stop words, removal of useless and illegal characters and obtaining of word sequences;

analyzing and identifying knowledge point information related in the questions according to keywords in the questions;

and analyzing formula expression information in the title according to the regular expression.

10. The system according to claim 8, wherein the title determination module determines similarity of titles according to tree structure information of the phrase structure tree and title content information, and comprises:

setting different weight values for the part of speech, calculating the similarity between phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.

The invention has the beneficial effects that:

(1) aiming at the problem characterization in the similar problem comparison process, the structural analysis is carried out on the problem by using the phrase structure tree, so that the structural characterization of the fine granularity of the problem description is realized.

(2) Aiming at similar topic comparison, the invention prunes the phrase structure tree on the basis of the representation of the phrase structure tree, refines the main part of the phrase structure tree, compares the phrase structure tree and compares the topics on the topic structure layer.

(3) Aiming at the comparison of similar topics, the accuracy of similarity judgment is improved by comparing the fine granularity of knowledge point information, formula information and specific text information contained in the comparison topics on the basis of the phrase structure tree comparison.

Drawings

FIG. 1 is a flow chart illustrating a similar topic identification method based on a phrase structure tree according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a topic text preprocessing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for constructing a phrase structure tree for topics according to an embodiment of the present invention;

FIG. 4 is a diagram of a phrase structure tree;

FIG. 5 is a diagram of a phrase structure tree;

FIG. 6 is a diagram of a phrase structure tree;

FIG. 7 is a flow chart illustrating topic similarity determination according to one embodiment of the present invention;

FIG. 8 is a diagram of a phrase structure tree;

FIG. 9 is a diagram of a phrase structure tree;

FIG. 10 is a diagram of a phrase structure tree;

FIG. 11 is a diagram of a phrase structure tree;

FIG. 12 is a diagram illustrating a structure of a similar topic identification system based on a phrase structure tree according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, a technical solution in an embodiment of the present invention will be described in detail below with reference to the accompanying drawings in the embodiment of the present invention, where the embodiment described below is a part of the embodiment of the present invention, but not all of the embodiment. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The phrase structure tree is to output the result of the phrase structure analysis of the sentence in a tree structure, that is, for each input sentence, the analysis is completed by constructing a phrase tree, and the phrase structure tree can not only show the grammatical relationship of the sentence, but also show the hierarchy of the sentence. The phrase structure between sentences can be analyzed quickly from the phrase structure tree, for example, the node label NP indicates that the part is a noun phrase. In the phrase structure tree, when the nearest father node of two phrases belongs to the same node, the two phrases are called as the same-level phrases, and besides, the phrase structure tree can also analyze the parallel structure, the clause structure and the like in the sentence.

The invention is described in detail below with reference to the figures and the detailed description. According to an aspect of the present invention, a similar topic identification method based on a phrase structure tree is provided, as shown in fig. 1, including the following steps:

s1, performing text preprocessing on input questions;

s2, constructing a phrase structure tree based on the title information;

and S3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity condition of the subjects according to the tree structure of the phrase structure tree and the similarity of the leaf node contents.

In step S1, the presentation forms of the titles often differ depending on the storage method and the application environment. For different display requirements of topics, the coding modes of the topics are different in the situations of GB2312, GBK, UTF-8 and the like. Therefore, uniform text preprocessing needs to be performed on the questions to be compared, so that similarity comparison can be performed on the questions subsequently, and the accuracy of the similarity comparison is improved.

As shown in fig. 2, the text preprocessing operation mainly includes the following operations:

(1) and analyzing knowledge point information related in the title according to the keywords in the title.

(2) Identifying formula expression information in the topic information through the regular expression.

(3) Unified coding processing: unifying the title coding format to UTF-8;

for character normalization, for example, characters such as "4", "a" may exist in the title, and are normalized to "4", "a";

convert various types of spaces to chinese spaces, and convert various types of punctuation marks to chinese punctuation marks, such as the english "? "convert to chinese"? ";

uniformly converting English characters in the title into a lower case format;

(4) word segmentation:

chinese characters of the subject content need to be segmented firstly;

after the topic is divided into words, the topic content is converted into a sequence represented by words with separated spaces;

(5) removing stop words:

in order to improve the accuracy of topic similarity comparison, some words which are less important to the topic can be removed. A common stop vocabulary is used here.

(6) Removal of useless and illegal symbols:

remove the empty brackets in the title and the condition that there are one or more blank spaces in the brackets, such as: "()", "()" etc.;

removing redundant or mismatched symbols at the end of the topic description such as: is "? Is there a "," ("," [ "," { "etc.;

removing only the nonsense cases with sequence numbers in the title description such as "A, B, CD", "A, B, C, D", "A, B, C", etc.

Removing line feed symbols, tabulation symbols, underliningLines and illegal characters such as "□", "\ xa 0", "\ xc 2", "\ x0 b", "\ x0 c", "\ x0 d" \ x0f ", etc.;

removing characters out of the character set, such as emoji symbols and the like which cannot be normally displayed, which are messy code characters;

these operations are not in chronological order, and those skilled in the art can set their specific execution steps as required.

The knowledge point refers to a general term for a certain knowledge, and particularly refers to the knowledge on a textbook or an examination. For example, the problem of calculating the length of the oblique angle side of a known right-angled triangle belongs to the knowledge point of the Pythagorean theorem. And matching in a predefined knowledge point library according to the keywords in the topic information to acquire knowledge point information related to the topic.

The knowledge point library comprises information of each knowledge point appearing in primary and middle schools and related descriptive keywords (the keywords are used for describing specific knowledge under the knowledge points, for example, the "three solution function" of the knowledge points comprises keywords such as "arbitrary angle", "radian system", "sine", "cosine", "tangent", and the like), and the structure and the example information of the knowledge point library are shown in table 1.

Table 1 knowledge points base structure and examples

In one embodiment, the input topics are as follows:

title: equation x²The opposite of the root of +6 x +9 ═ 0 is ()?

Uniformly converting English brackets into Chinese brackets through text preprocessing; removing the blank space; the empty brackets are removed.

Analyzing the question information to obtain a formula expression of the question as follows: x is the number of²+6*x+9＝0

And matching the topic information to obtain the knowledge point information of the keyword 'equation' in the topic, which belongs to the knowledge point database, as 'function and equation'.

The topic information is the word sequence, knowledge point information and formula information generated after the processing.

In step S2, as shown in fig. 3, the phrase structure tree is constructed for the topic information, and the following steps are mainly included.

The phrase structure tree is constructed by performing lexical analysis and syntactic analysis on word sequences in topic information on the basis of word segmentation, then constructing the phrase structure tree, representing the relation between words by using each tree node, taking the content of leaf nodes as the segmentation in the topic information, and further converting the topic information into the phrase structure tree for representation.

In computer science and technology, a phrase structure tree is a data structure used to express the syntactic structure of a sentence. We apply this idea to the processing of topic information, building the topic information into a tree structure representation, where leaf nodes are associated with words in the input sentence, and other intermediate node contents are labels of the short language components. If NP represents the phrase as a noun phrase, VP represents the phrase as a verb phrase. The construction steps of the short language structure tree mainly comprise lexical analysis and syntactic analysis.

And the lexical analysis is a process of matching lexical rules to read-in character strings, and comprises the steps of scanning texts to be analyzed from left to right character by character, analyzing and classifying parts of speech in a segmentation result based on 12 types of words, and determining the lexical rules. The part-of-speech categories are nouns, verbs, adjectives, numerators, quantifiers, pronouns, adverbs, prepositions, conjunctions, auxiliary words, sighs and pronouns respectively.

And the grammar analysis is to take the character stream after word segmentation as input and identify whether the word segmentation sequence given by the lexical analysis is a sentence which accords with grammar rules. Modern Chinese grammar has multiple sentence structure, such as a chief and predicate structure, a bingo structure, etc., and the grammar analysis mainly analyzes sentence structure information in question information.

Through the phrase structure tree, the relation of each part between sentences can be clearly cleared. Wherein the relationship represented by the nodes in the phrase structure tree and their meanings are shown in Table 2:

TABLE 2 expression of nodes in the phrase Structure Tree and meanings

For example, the phrase "lovely classmates sit on flying high-speed rail", the phrase structure tree structure is constructed as shown in fig. 4. It is expressed as follows in terms of a tree structure:

[ S [ VP [ CP [ ADJP lovely ] [ NP student ] ] [ VV sit ] ] [ NP [ CP [ VP [ flying ] [ NN high-speed iron ] ] [ LC ] ] ]

For another example, topic information: "equation x²The opposite number of the root of +6 × x +9 ═ 0 is ", the phrase structure tree structure is constructed as shown in fig. 5, and the phrase structure tree store indicates: [ S [ NP [ NN equation ]][NR x^2+6*x+9＝0][ of DNP]][ NP [ NN root ]][ of DNP][ NN inverse number]][ VV ] is]]

The formula information is as follows: x 2+6 x +9 ═ 0

The related knowledge point information is as follows: functions and equations.

In another example, the topic information: the inverse of the reciprocal of "-3" is constructed as a phrase structure tree structure as shown in fig. 6. The phrase structure tree store representation: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN inverse of DNP ] ] [ VV is ] ]

The formula information is as follows: -3

The related knowledge point information is as follows: functions and equations.

In step S3, a pruning operation is performed on the phrase structure tree, the phrase structure tree is traversed, and the similarity of topics is determined according to the tree structure information of the phrase structure tree and the topic content information.

As shown in fig. 7, the determination of similarity of topics based on the phrase structure tree mainly includes two operations, namely, pruning the phrase structure tree, traversing the phrase structure tree and comparing tree structure information of the phrase structure tree, and then comparing content information of the phrase structure tree, including comparison of knowledge point information related to the topics with formula information and topic specific content information. The method comprises the following specific steps:

(1) pruning the phrase structure tree:

and pruning the phrase structure tree, including pruning inserted words and pruning unrealistic words such as language words, sound-like words and sentence break nodes. The insertion language belongs to an independent language in the sentence, and the sentence can be simplified by removing the insertion language. The words without practical meaning do not contain or contain a small amount of semantic information in the sentence, and the meaning expressed by the sentence is not influenced after the words without practical meaning are removed.

The part marked PRN in the phrase structure tree is the insertion, and we prune the insertion part, delete all its children nodes and then merge the rest parts together. The part marked as Y in the phrase structure tree is a tone word, the part marked as O is a pseudonym, the part marked as PU is a sentence break node, the marked parts are pruned, all the child nodes are deleted, and then the rest parts are combined together. The phrase structure tree before pruning is shown in fig. 8, and the phrase structure tree store is represented as:

[ NP [ NN Xiaoming ] [ VP [ VV have ] [ QP [ CD three ] [ M ] ] [ NN rabbit ] [ PU ] ] ] ] [ VP [ D and ] [ VV get ] [ QP [ CD two ] [ M ] ] [ PU ] ] ] [ VP [ P is shared ] [ QP [ CD several ] [ M ] ] [ PU ]? ]]]

Post-pruning phrase structure tree as shown in fig. 9, the phrase structure tree store is represented as:

[ S [ NP [ NN Xiaoming ] [ VP [ VV have ] [ QP [ CD three ] [ M ] ] [ NN Rabbit ] ] ] [ VP [ VV get ] [ QP [ CD two ] [ M ] ] ] [ VP [ P common ] [ QP [ CD several ] [ M ] ] ] ] ]

(2) Traversing the phrase structure tree:

for the traversal of the phrase structure tree, a tree hierarchy traversal method is adopted, and a specific algorithm is described as follows:

initializing a queue Q, and adding a root node S of the phrase structure tree into the queue;

while queue Q is not empty:

taking out the head node element of the queue Q;

accessing the node value;

if the child node of the node is not empty and the child node is not a leaf node, the child node is added to the queue.

(3) Comparing the tree structure information of the phrase structure tree:

comparing the tree structure information of the phrase structure tree of the question, if the tree structure information of the phrase structure tree is different, judging that the question is different, otherwise, continuously comparing the content information of the phrase structure tree;

the specific comparison process is as follows:

two phrase structure trees T to be compared in the hierarchical traversal of the phrase structure trees₁And T₂Firstly, initializing two queues P and Q, and firstly, setting the root node S of two phrase structure trees₁And S₂Adding the queues P and Q respectively, and then taking out the head nodes of the two queues, namely S₁And S₂Making a comparison if S₁And S₂Content of (1) and S₁And S₂Content C of subtree node₁And C₂All are the same, then the subtree node C is connected₁And C₂Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.

And after one round of comparison is finished, judging whether the two queues P and Q are empty, if the two queues are not empty, continuously taking out the head node from the queues, and continuously performing the comparison. And if one of the two queues is empty and the other queue is not empty, judging that the structures of the two phrase structure trees are different. And if the two queues are empty, the structure comparison of the phrase structure tree is finished.

(4) Comparing the title content information of the phrase structure tree:

comparing title content information of the phrase structure tree, wherein the method comprises the following steps:

firstly, comparing whether the knowledge point information related to the two questions is the same or not, and if not, judging that the questions are different; if the knowledge point information is the same, continuously comparing whether formula expressions contained in the phrase structure tree are the same, and if not, judging that the questions are different; and if the formula expression information is the same, comparing the specific content information of the topics. In the comparison of the topic contents, different weight values are set for the part of speech categories, then the similarity of the two phrases is calculated, if the similarity is larger than a set threshold value, the topics are judged to be the same, and if not, the topics are judged to be different. The calculation formula of the similarity score is as follows:

As compared with the title "the four inventions make a prominent contribution to the world" and the title "yaoming makes a great contribution to the sports world", the noun part is { the four inventions, yaoming, the world, the sports world, and the contribution }, the verb part is { the pair, the giving, and the making }, and the adjective is { the prominent, great }, which is specifically expressed as follows:

part of speech: verb noun, verb noun, adjective noun

In one embodiment, the noun, verb, and adjective are weighted to 0.2, 0.3, and 0.1, and the threshold is set to 0.8, then

Part of speech: verb noun, verb noun, adjective noun

Therefore, score (0.2 × 0+0.3 × 0+0.2 × 0+0.3 × 1+0.1 × 0+0.2 × 1)/(0.2 × 1+0.3 × 1+0.1 + 1+0.2 × 1) is 0.4167, which is smaller than the set threshold of 0.4167, and thus, the contents of the two phrases are determined to be different, i.e., the titles are not the same.

In another example, two topics to be compared are specified below:

topic 1: equation x²The inverse of the root of +6 x +9 ═ 0 is

Topic 2: the inverse of the reciprocal of-3 is

The constructed phrase structure tree is represented as:

topic 1: [ S [ NP [ NN equation ] [ NR x ^2+6 x +9 ═ 0] [ DNP ] ] [ NP [ NN root ] [ NN inverse of DNP ] ] [ VV is ] ]

Topic 2: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN inverse of DNP ] ] [ VV is ] ]

The phrase structure tree is pruned first, and no part which can be pruned is found. Then comparing the structure of the phrase structure tree, initializing two queues P and Q in a hierarchical traversal mode, and setting the root nodes S of the two trees₁And S₂Respectively adding the data into queues P and Q, wherein the queues P and Q are not empty, and adding the head node S₁And S₂And taking out, wherein the contents of the two nodes are 'S', and three subtree nodes 'NP', 'VV' are provided, and the contents of the subtree nodes are the same. Therefore, subtree nodes "NP", "NP", "VV" are added to queues P and Q, respectively. At this time, the queues P and Q are not empty, the head node "NP" of the two queues is taken out, and it is found that the subtree of topic 1 has three nodes, which are respectively "NN", "NR" and "DNP", while the subtree of topic 2 has only two nodes, "NR" and "DNP", so that it is determined that the structures of the two phrase structure trees are different, and further, it is determined that the two topics are not similar.

In another example, two topics to be compared are specified below:

topic 3: the authors of the Western notes complied with the resistance spirit

Topic 4: the author of the western shorthand tells what story

The tree structure is shown in fig. 10 and 11. The constructed phrase structure tree structure is as follows:

topic 3: [ S [ NP [ NP [ NN West note ] [ DNP ] ] [ NN author ] ] [ VP [ VV praise ] [ AS ] ] [ NP [ NN reactance ] [ NN spirit ] ] ]

Topic 4: [ S [ NP [ NP [ NN Western notation ] [ DNP ] ] [ NN author ] ] [ VP [ VV lecture ] [ AS ] ] [ NP [ what [ PN ] [ NN story ] ] ]

The phrase structure tree is pruned first, and no part which can be pruned is found. Then comparing the structure of phrase structure tree, initializing two queues P and Q in hierarchical traversal mode, and setting the root nodes S of two trees₁And S₂Respectively adding the data into queues P and Q, wherein the queues P and Q are not empty, and adding the head node S₁And S₂Taking out, the contents of the two nodes are 'S', and three subtree nodes 'NP', 'VP' and 'NP' are provided, and the contents of the subtree nodes are the same. Therefore, the subtree nodes NP, VP and NP are added into the queue respectively. At this time, queues P and Q are not empty, a head node NP of the two queues is taken out, the subtree of the topic 3 is found to be the same as the subtree of the topic 4, and subtree nodes NN and DNP are added into the queues. And continuously taking out the head nodes 'VP' in the queues P and Q, comparing child nodes thereof, and finding that the child nodes are 'VV' and 'AS'. And continuously taking out the queue P and the Q child node NP, comparing the child nodes, finding that the child nodes of the topic 3 are NN, and the child nodes of the topic 4 are PN and NN which are different, so that the structure of the two phrase structure trees is judged to be different, and the two topics are judged to be different.

According to another aspect of the present invention, a similar topic identification system based on a phrase structure tree is provided, which includes: a topic text preprocessing module, a phrase structure tree building module and a topic determination module, as shown in fig. 12.

The question text preprocessing module is used for reading question information to be compared and question information of a question bank, performing corresponding text preprocessing on the question text, analyzing knowledge point information, formula expression information and question information in the question and transmitting the knowledge point information, the formula expression information and the question information to the phrase structure tree building module; the specific procedures are as described above.

The short language structure tree building module is used for performing lexical analysis and syntactic analysis on the question according to the question information acquired by the question text preprocessing module, building a short language structure tree by combining knowledge point information and formula expression information in the question and transmitting the short language structure tree to the question judging module; the specific procedures are as described above.

And the title judging module firstly performs pruning operation on the phrase structure tree according to the phrase structure tree information of the titles to be compared, then traverses the phrase structure tree hierarchically, judges the similar situation of the titles according to the tree structure information of the phrase structure tree and the title content information, and performs corresponding processing on the titles.

In the title judging module, the phrase structure tree is pruned, the phrase structure tree is traversed, the tree structure information of the phrase structure tree is compared, and then the content information of the phrase structure tree is compared, wherein the comparison comprises the comparison of knowledge point information related to a title with formula information and specific title content information. The method comprises the following specific steps:

(1) pruning the phrase structure tree:

The part marked PRN in the phrase structure tree is the insertion, and we prune the insertion part, delete all its children nodes and then merge the rest parts together. The part marked as Y in the phrase structure tree is a tone word, the part marked as O is a pseudonym, the part marked as PU is a sentence break node, the marked parts are pruned, all the child nodes are deleted, and then the rest parts are combined together.

(2) Traversing the phrase structure tree:

while queue Q is not empty:

taking out the head node element of the queue Q;

accessing the node value;

(3) Comparing the tree structure information of the phrase structure tree:

the specific comparison process is as follows:

(4) Comparing the title content information of the phrase structure tree:

By the method or the system, the questions in the question bank can be compared one by one, so that the questions with the same or high similarity are deleted, the redundancy of the question bank is reduced, and the quality of the question bank is improved.

Technical contents not described in detail in the present invention belong to the well-known techniques of those skilled in the art.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A similar topic identification method based on a phrase structure tree is characterized by comprising the following steps:

s1, performing text preprocessing on input questions;

s2, constructing a phrase structure tree aiming at the question information;

s21, performing lexical analysis on the word sequences;

s22, carrying out syntactic analysis on the word sequence;

4. The method according to claim 1, wherein in the step S3, the step of pruning comprises:

s31, pruning the inserted words;

and S32, pruning words without practical significance.

s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, judging that the topics are different, and if not, entering a step S34;

and S34, comparing whether the content information of the phrase structure tree is the same or not, if not, judging that the titles are different, and otherwise, judging that the titles are the same.

setting different weight values for the part of speech categories, calculating the similarity of the two phrases, judging that the titles are the same if the similarity is greater than a set threshold, and otherwise, judging that the titles are different.

wherein w_iThe weight of the part of speech corresponding to the ith word in the leaf node of the phrase structure tree, c_iIs the comparison result of the ith participle of the two phrase structure trees, if the ith participle is the same c_i1, otherwise c_i＝0。

the phrase structure tree building module is used for performing lexical analysis and syntactic analysis on the questions according to the question information acquired by the question text preprocessing module, building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module;

and the question judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the questions to be compared, then traversing the phrase structure tree hierarchically, judging the similarity of the questions according to the tree structure information of the phrase structure tree and the question content information and carrying out corresponding processing on the questions.