CN111898343B

CN111898343B - Similar topic identification method and system based on phrase structure tree

Info

Publication number: CN111898343B
Application number: CN202010765054.2A
Authority: CN
Inventors: 陈鹏鹤; 卢宇; 余胜泉; 刘杰飞
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2023-07-14
Anticipated expiration: 2040-08-03
Also published as: CN111898343A

Abstract

The invention provides a similar topic identification method and a similar topic identification system based on a phrase structure tree, comprising the following steps: s1, preprocessing a text according to an input question; s2, constructing a phrase structure tree aiming at the topic information; s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the questions according to the tree structure information and the leaf node content information of the phrase structure tree. The method mainly aims at the comparison and identification problem of similar subjects of primary and secondary schools, constructs a phrase structure tree for the subjects to be compared, and evaluates the similarity condition of the subjects through comparison of the phrase structure tree, so that redundancy of a subject library is reduced.

Description

Similar topic identification method and system based on phrase structure tree

Technical Field

The invention relates to the technical field of education, in particular to a similar topic identification method and system based on phrase structure tree.

Background

The question data are important components in educational resources, and in the learning and teaching process, exercise questions used by students in daily life and examination questions used for testing belong to the question data. With the development of computer and internet technology, the electronic storage of the topic data in the middle and primary school education is basically realized. The question data can help students deepen learning and understanding of knowledge in the learning process, can help teachers to master knowledge in time, learn learning progress of the students, help the students to check for defects and mend leaks, and improve learning efficiency.

For the construction of the multidisciplinary subject library of primary and secondary schools, on one hand, the updating and the management of multidisciplinary subject data of primary and secondary schools are convenient, and on the other hand, the working strength of teacher teaching can be reduced. With the continuous updating and increasing of the question data in the question bank, two or more questions can be the same or similar. On one hand, the occurrence of the same or similar questions makes the question bank redundant and huge, and more storage and calculation resources are required to be consumed; on the other hand, the retrieval and use efficiency of the question bank data can be affected.

It is therefore necessary to screen the questions in the question bank and remove the same or similar questions. In the task of identifying similar topics, evaluation and calculation of the similarity of two topics are one of the most important. The current method for calculating the similarity of the questions is to treat the questions to be compared as two continuous character strings. One way is to evaluate the similarity of the questions by the distance measurement of the character string, for example, after the characters are expressed as vectors, the cosine included angle or Euclidean distance between the two vectors is calculated; another way is to reduce the dimensions of the text, e.g. to generate a SimHash value, i.e. a fingerprint (fingerprint), for the character string, by which the similarity of the two character strings is evaluated.

It should be noted that the above methods treat the title as a whole string, and in practice, a complete title often contains different expressions, such as a common character expression and a formula expression. If the whole title is simply processed according to the character string, the similarity of the title cannot be accurately estimated. And some topics are the same in character, but different sentence structures lead to different represented topic information, and are actually different topics. Such as the inverse of the reciprocal of "-3" and the reciprocal of the inverse of "-3". There is a need for a method that can more accurately determine whether the topics are the same. The phrase structure tree is a structure which can well represent key positions and key information in sentences.

Disclosure of Invention

Aiming at the problems, the invention provides a similar topic identification method and a similar topic identification system based on a phrase structure tree, which are characterized in that topic data are subjected to text preprocessing, knowledge point information and formula information related to topics are analyzed, then a phrase structure tree is constructed aiming at the topic information, the constructed phrase structure tree is subjected to pruning processing, hierarchical traversal is carried out, and the structure information and leaf node content information of the tree are compared to further realize the comparison of the similarity between two topics.

According to one aspect of the present invention, a method for identifying similar topics based on a phrase structure tree is provided, comprising the steps of:

s1, preprocessing a text according to an input question;

s2, constructing a phrase structure tree aiming at the topic information;

s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the questions according to the tree structure and the leaf node content of the phrase structure tree.

2. The method according to claim 1, wherein in the step S1, text preprocessing is performed on an input title, including:

s11, unified coding processing, word segmentation, stop word removal and useless and illegal character removal, so that a word sequence is obtained;

s12, analyzing and identifying knowledge point information related to the questions according to the keywords in the questions;

s13, analyzing formula expression information in the questions according to the regular expression.

3. The method according to claim 2, wherein in the step S2, the step of constructing a phrase structure tree for the topic information includes:

s21, performing lexical analysis on the word sequence;

s22, carrying out grammar analysis on the word sequence;

s23, constructing a phrase structure tree according to the results of lexical analysis and grammar analysis.

4. The method according to claim 1, wherein in the step S3, the step of pruning includes:

s31, pruning the insert language;

s32, pruning the words without practical meaning.

5. The method according to claim 4, wherein in the step S3, the step of determining the similarity of the topics includes:

s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, judging that the topics are different, otherwise, entering step S34;

s34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the questions are different, otherwise, judging that the questions are the same.

6. The method according to claim 5, wherein in the step S34, the step of comparing the content information of the phrase structure tree includes:

comparing whether knowledge point information related to the questions is the same or not, if yes, judging that the questions are different;

comparing whether the expression of the formulas contained in the phrase structure tree is the same or not, and if the expression of the formulas contained in the phrase structure tree is different, judging that the topics are different;

and setting different weight values for the parts of speech, calculating the similarity of the two phrases, and judging that the questions are the same if the similarity is larger than a set threshold value, otherwise, judging that the questions are different.

7. The method of claim 6, wherein the similarity is calculated by the formula:

wherein w is _i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree _i Comparing the ith word of the two phrase structure trees, if the ith word is the same c _i =1, otherwise c _i ＝0。

8. A similar topic identification system based on a phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judging module, wherein:

the topic text preprocessing module is used for reading topic information to be compared and topic information of a topic library, carrying out corresponding text preprocessing on the topic text, analyzing knowledge point information and formula expression information in the topic, and finally transmitting the topic information to the phrase structure tree building module;

the phrase structure tree building module is used for carrying out lexical analysis and grammar analysis on the questions according to the question information acquired by the question text preprocessing module, and building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module;

the topic judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the topics to be compared, traversing the phrase structure tree in a layering manner, judging the similarity of the topics according to the tree structure information of the phrase structure tree and the topic content information, and carrying out corresponding processing on the topics;

9. the system of claim 8, wherein in the topic text preprocessing module, the method of preprocessing the topic text comprises;

unified coding processing, word segmentation, stop word removal, useless and illegal character removal, and word sequence obtaining;

analyzing and identifying knowledge point information related in the questions according to the keywords in the questions;

and analyzing the formula expression information in the title according to the regular expression.

10. The system of claim 8, wherein the topic determination module determines similarity of topics based on tree structure information of a phrase structure tree and topic content information, the method comprising:

and setting different weight values for the parts of speech, calculating the similarity between phrases, and judging that the topics are the same if the similarity is larger than a set threshold value, otherwise, judging that the topics are different.

The beneficial effects of the invention are as follows:

(1) Aiming at the topic characterization in the similar topic comparison process, the phrase structure tree is utilized to carry out structural analysis on the topics, so that the structural characterization of fine granularity of the topic description is realized.

(2) Aiming at similar topic comparison, the invention performs comparison on topics at the topic structure level by pruning the phrase structure tree, extracting the main part of the phrase structure tree and performing comparison on the phrase structure tree on the basis of the phrase structure tree representation.

(3) Aiming at comparison of similar topics, the accuracy of similarity judgment is improved by comparing knowledge point information, formula information and detailed granularity of specific text information contained in the comparison topics on the basis of phrase structure tree comparison.

Drawings

FIG. 1 is a flow diagram of a method for identifying similar topics based on a phrase structure tree according to one embodiment of the present invention;

FIG. 2 is a flow diagram of a method for topic text preprocessing in accordance with one embodiment of the present invention;

FIG. 3 is a flow diagram of a topic construction phrase structure tree in accordance with one embodiment of the present invention;

FIG. 4 is a schematic diagram of a phrase structure tree;

FIG. 5 is a schematic diagram of a phrase structure tree;

FIG. 6 is a schematic diagram of a phrase structure tree;

FIG. 7 is a flow chart of a topic similarity determination in accordance with one embodiment of the present invention;

FIG. 8 is a schematic diagram of a phrase structure tree;

FIG. 9 is a schematic diagram of a phrase structure tree;

FIG. 10 is a schematic diagram of a phrase structure tree;

FIG. 11 is a schematic diagram of a phrase structure tree;

FIG. 12 is a schematic diagram of a similar topic identification system based on a phrase structure tree in accordance with an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be examined and fully described below with reference to the accompanying drawings in the embodiments of the present invention, wherein the embodiments described below are some embodiments, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The phrase structure tree refers to outputting the result of phrase structure analysis of sentences in a tree structure, namely, analyzing each input sentence by constructing the phrase tree, wherein the phrase structure tree can represent not only the grammar relation of the sentences but also the hierarchy of the sentences. From the phrase structure tree, the phrase structure between sentences can be quickly analyzed, such as the node labeled NP indicates that the portion is a noun phrase. In the phrase structure tree, when the nearest father node of two phrases belongs to the same node, the two phrases are called as the same-level phrases, and besides, the phrase structure tree can analyze parallel structures, clause structures and the like in sentences.

The invention will be described in detail with reference to the drawings and detailed description. According to one aspect of the present invention, a similar topic identification method based on phrase structure tree is provided, as shown in fig. 1, comprising the following steps:

s1, preprocessing a text according to an input question;

s2, constructing a phrase structure tree based on the topic information;

s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the topics according to the tree structure of the phrase structure tree and the similarity of the leaf node contents.

In step S1, the presentation forms of the topics are often different depending on the storage method and the application environment. For example, the encoding modes of the questions are different from GB2312, GBK, UTF-8 and the like for different display requirements of the questions. Therefore, unified text pretreatment is needed for the topics to be compared, so that the subsequent similarity comparison of the topics is facilitated, and the accuracy of the similarity comparison is improved.

As shown in fig. 2, the text preprocessing operation mainly includes the following operations:

(1) Knowledge point information related in the title is analyzed according to the keywords in the title.

(2) The formula expression information in the topic information is identified by a regular expression.

(3) Unified coding treatment: the unified title coding format is UTF-8;

for character normalization, for example, characters such as characters "4", "a" and the like may exist in the title, and the characters are normalized to "4", "a";

converting various types of spaces into Chinese spaces, converting various types of punctuation into Chinese punctuation, e.g., english "? "convert to Chinese"? ";

uniformly converting English characters in the topics into a lowercase format;

(4) Word segmentation:

the Chinese characters of the topic content need to be segmented firstly;

after the topic is segmented, the topic content is converted into a sequence of word representations separated by spaces;

(5) Removing stop words:

to improve the similarity comparison of topicsPrecision, some words that are less important to the topic may be removed. A common stop phrase is used herein.

(6) Removing useless and illegal symbols:

removing empty brackets from the title and one or more spaces in the brackets, such as:

"()", etc.;

removing redundant or unmatched symbols at the end of the title description, such as: "=",

"(", "[", "{" etc.;

the nonsensical case of only sequence numbers in the description of the title, such as "A, B, CD", "A, B

C. D, "A, B, C", etc.

Removing line-feed symbols, tabulated symbols, underlines and illegal characters such as "≡", "\xa0"

"\x2", "\x0b", "\x0c", "\x0d" \x0f ", etc.;

removing the messy code characters outside the character set, and enabling characters which cannot be normally displayed to be like emoji symbols;

the operations are not time-sequential, and the person skilled in the art can set their specific execution steps as desired.

Knowledge points refer to general terms of a certain knowledge, in particular to knowledge on textbooks or on examination. For example, the problem of "knowing the right-angle side length of a right triangle and calculating the bevel side length thereof" is that the knowledge point belongs to the Pythagorean theorem. And matching in a predefined knowledge point base according to keywords in the topic information to obtain knowledge point information related to the topic.

The knowledge point library comprises information of each knowledge point appearing in middle and primary schools and keywords of related description related to each knowledge point (the keywords are used for describing specific knowledge under the knowledge point, for example, the knowledge point 'three-solution function' comprises keywords such as 'random angle', 'radian', 'sine', 'cosine', 'tangent', and the like), and the knowledge point library structure and example information are shown in table 1.

TABLE 1 knowledge Point library Structure and examples

In one embodiment, the input questions are as follows:

title: equation x ² The opposite number of roots of +6x+9=0 is ()?

Uniformly converting English brackets into Chinese brackets through text pretreatment; removing the blank; the empty brackets are removed.

The title information is analyzed, and the formula expression of the obtained title is as follows: x is x ² +6*x+9＝0

And matching the topic information to obtain knowledge point information of a keyword equation in the topic in a knowledge point base as a function and an equation.

The topic information is word sequence, knowledge point information and formula information generated after the processing.

In step S2, as shown in fig. 3, a phrase structure tree is constructed for the topic information, mainly including the following steps.

The construction of the phrase structure tree is to perform lexical analysis and grammatical analysis on word sequences in the topic information on the basis of word segmentation, then construct the phrase structure tree, represent the relation among words by each tree node, and the content of the leaf nodes is the word segmentation in the topic information so as to convert the topic information into the phrase structure tree representation.

In computer science and technology, a phrase structure tree is a data structure used to express the syntactic structure of a sentence. We apply this idea to the processing of topic information by constructing the topic information as a tree structure representation in which leaf nodes are associated with words in the input sentence and other intermediate node content is a marker of phrase components. If NP indicates that the phrase is a noun phrase, VP indicates that the phrase is a verb phrase. The phrase structure tree is constructed mainly through lexical analysis and grammar analysis.

The lexical analysis is a process of matching lexical rules on the read-in character strings, and the lexical rules are determined by scanning the text to be analyzed from left to right character by character, analyzing and classifying the parts of speech in the word segmentation result based on 12 categories of words. The part of speech categories are nouns, verbs, adjectives, numerical words, adjectives, pronouns, adverbs, prepositions, conjunctions, auxiliary words, exclaments and personification respectively.

And the grammar analysis is to take the character stream after word segmentation as input and identify whether the word segmentation sequence given by the lexical analysis is a sentence conforming to the grammar rule. The modern Chinese grammar has various sentence structures, such as a main-predicate structure, a movable-predicate structure and the like, and the grammar analysis mainly analyzes sentence structure information in the topic information.

Through the phrase structure tree, the relation of all parts among sentences can be clearly cleared. Wherein the relationships represented by the nodes in the phrase structure tree are as shown in Table 2:

table 2 node representations and meanings in phrase structure tree

For example, the sentence "loved classmates sit on a flying high-speed rail", and the phrase structure tree structure is constructed as shown in fig. 4. It is expressed in terms of a tree structure as follows:

[ S [ VP [ CP [ ADJP loved ] [ NP classmates ] ] [ VV sits on ] ] [ NP [ CP [ VP fly ] [ NN high-speed rail ] ] [ LC ] ] ] ]

For another example, topic information: "equation x ² The opposite number of roots of +6x+9=0 is "the phrase structure tree structure is constructed as shown in fig. 5, the phrase structure tree storing representation: [ S [ NP [ NN ] equation][NRx^2+6*x+9＝0][ DNP ]]][ NP [ NN root ]][ DNP ]][ NN inverse number ]]][ VV is]]

Wherein the formula information is: x 2+6 x+9=0

The knowledge point information involved is: functions and equations.

In another example, the topic information: the inverse of the reciprocal of "-3 is" and the phrase structure tree structure is constructed as shown in FIG. 6. The phrase structure tree stores representations: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN opposite number of DNP ] ] [ VV is ]

Wherein the formula information is: -3

The knowledge point information involved is: functions and equations.

In step S3, pruning is performed on the phrase structure tree, the phrase structure tree is traversed, and the similarity of the topics is determined according to the tree structure information and the topic content information of the phrase structure tree.

As shown in fig. 7, determining the similarity of topics based on the phrase structure tree mainly includes two operations, namely pruning the phrase structure tree and traversing the phrase structure tree and comparing tree structure information of the phrase structure tree, and then comparing content information of the phrase structure tree, including comparison of knowledge point information related to the topics with formula information and topic specific content information. The method comprises the following specific steps:

(1) Pruning the phrase structure tree:

pruning operations are performed on the phrase structure tree, including pruning of insert words, pruning of words with no practical meaning, such as word and phrase, phonetic words, and punctuation nodes. Wherein the insert words belong to independent words in the sentence, and the sentence can be simplified by removing the insert words. Words without practical meaning do not contain or contain a small amount of semantic information in sentences, and the meaning of sentence expression is not affected after the words are removed.

The portion labeled PRN in the phrase structure tree is the insert, we prune the insert portion, delete all its children and then merge the remaining portions together. The part marked as Y in the phrase structure tree is a word of a Chinese language, the part marked as O is a part of an anthropomorphic word, the part marked as PU is a sentence breaker node, the part marked as the above is pruned, all the child nodes are deleted, and the rest parts are combined together. The pre-pruning phrase structure tree is shown in fig. 8, and the phrase structure tree store is expressed as:

[ S [ NP [ NN [ small-sized ] [ VP [ VV ] with ] [ QP [ CD three ] [ M ] ] [ NN rabbit ] [ PU, ] ] [ VP [ D ] and ] [ VV ] to obtain ] [ QP [ CD two ] [ M ] with [ VP [ P ] sharing ] [ QP [ CD several ] [ M ] with [ PU? ]]]

The pruned phrase structure tree is shown in fig. 9, and the phrase structure tree store is expressed as:

[ S [ NP [ NN small amine ] [ VP [ VV has ] [ QP [ CD three ] [ M ] only ] ] [ NN rabbit ] ] [ VP [ VV gives [ QP [ CD two ] [ M ] ] [ VP [ P sharing ] [ QP [ CD several ] [ M ] ] ] ]

(2) Traversing the phrase structure tree:

the traversing of the phrase structure tree adopts a tree hierarchy traversing method, and the specific algorithm is described as follows:

initializing a queue Q, and adding a root node S of the phrase structure tree into the queue;

while queue Q is not empty:

taking out the head node element of the queue Q;

accessing the node value;

if the child node of the node is not null and the child node is not a leaf node, the child node is added to the queue.

(3) Comparing tree structure information of phrase structure tree:

comparing tree structure information of the phrase structure tree, if the tree structure information of the phrase structure tree is different, judging that the topics are different, otherwise, continuously comparing content information of the phrase structure tree;

the specific comparison process is as follows:

during the hierarchical traversal of the phrase structure tree,two phrase structure trees T to be compared ₁ And T ₂ Initializing two queues P and Q, and initializing the root node S of two phrase structure trees ₁ And S is ₂ Respectively adding the queues P and Q, then taking out the head nodes of the two queues, namely S ₁ And S is ₂ Comparing if S ₁ And S is ₂ Content of (2) and S ₁ And S is ₂ Subtree node content C of (2) ₁ And C ₂ All are the same, then the subtree node C ₁ And C ₂ Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.

After the comparison is finished, whether the two queues P and Q are empty or not is judged, if the two queues are not empty, the head node is continuously fetched from the queues, and the comparison is continued. If one of the two queues is empty and the other queue is not empty, the fact that the structures of the two phrase structure trees are different is judged. If both queues are empty, the structure comparison of the phrase structure tree is ended.

(4) Comparing topic content information of the phrase structure tree:

the method for comparing topic content information of phrase structure tree is as follows:

firstly, comparing whether knowledge point information related to two topics is the same or not, and if the knowledge point information is different, judging that the topics are different; if the knowledge point information is the same, continuing to compare whether the formula expressions contained in the phrase structure tree are the same, and if the formula expressions are different, judging that the topics are different; and if the formula expression information is the same, comparing the specific content information of the title. In the comparison of the question contents, different weight values are set for the part-of-speech categories, then the similarity of the two phrases is calculated, if the similarity is larger than a set threshold value, the questions are judged to be the same, otherwise, the questions are judged to be different. The calculation formula of the similarity score is:

wherein w is _i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree _i Comparing the ith word of the two phrase structure trees, if the ith word is the comparison result of the ith word of the two phrase structure treesIdentical c _i =1, otherwise c _i ＝0。

In the comparison of the topic "contribution of the four inventions to the world is significant" with the topic "Yao Mou contribution to the sports world," the noun portion is { four inventions, yao Mou, world, sports world, contribution }, the verb portion is { pair, give, make }, the adjective is { outstanding, great }, specifically expressed as follows:

in one embodiment, the weights of nouns, verbs and adjectives are divided into 0.2, 0.3 and 0.1, and the threshold value is set to 0.8

Therefore score= (0.2×0+0.3×0+0.2×0+0.3×1+0.1×0.1+0.2×1)/(0.2×1+0.3×1+0.2×1+0.3×1+0.1×1+0.2×1) =0.4167, which is smaller than the set threshold value 0.8, it is determined that the two phrase contents are different, i.e., the titles are different.

In another example, the two topics to be compared are specifically as follows:

title 1: equation x ² The opposite number of roots of +6x+9=0 is

Title 2: the inverse of the reciprocal of-3 is

The constructed phrase structure tree is expressed as:

title 1: [ S [ NP [ NN equation ] [ nrx2+6 x+9=0 ] [ DNP ] ] [ NP [ NN root ] [ NN opposite number of DNP ] ] [ VV is ] ]

Title 2: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN opposite number of DNP ] ] [ VV is ]

Pruning is firstly carried out on the phrase structure tree, and no part which can be pruned is found. Then comparing the structure of phrase structure tree, initializing two queues P and Q by hierarchical traversal, and connecting the root node S of two trees ₁ And S is ₂ Respectively adding into the queues P and Q, wherein the queues P and Q are not empty, and the head node S is connected with the queue ₁ And S is ₂ The two nodes are taken out as S, and three subtree nodes NP, VV are respectively arranged, and the subtree nodes have the same content. Subtree nodes "NP", "VV" are added to queues P and Q, respectively. At this time, the queues P and Q are not empty, the head nodes NP of the two queues are taken out, three nodes NN, NR and DNP are found in the subtree of the question 1, and only two nodes NR and DNP are found in the subtree of the question 2, so that the structure of the two phrase structure trees is different, and the dissimilarity of the two questions is further determined.

In another example, the two topics to be compared are specifically as follows:

title 3: the authors of the western-style diary praise against the spirit

Title 4: the author of the western-style diary tells what stories

The tree structure is shown in fig. 10 and 11. The constructed phrase structure tree structure is as follows:

title 3: [ S [ NP [ NP [ NN West-Loose-note ] [ DNP ] ] [ NN-author ] ] [ VP [ VV-praise ] [ AS ] ] [ NP [ NN-counter ] [ NN-spirit ] ] ]

Title 4: [ S [ NP [ NP [ NN West-Loose-note ] [ DNP ] ] [ NN-author ] ] [ VP [ VV-narration ] [ AS ] ] [ NP [ PN-what ] [ NN-story ] ] ])

Pruning is firstly carried out on the phrase structure tree, and no part which can be pruned is found. Then comparing the structure of phrase structure tree, initializing two queues P and Q by hierarchical traversal, and connecting the root node S of two trees ₁ And S is ₂ Respectively adding into the queues P and Q, wherein the queues P and Q are not empty, and the head node S is connected with the queue ₁ And S is ₂ The node contents are taken out as S, and three subtree nodes NP, VP and NP are respectively arranged, and the subtree nodes have the same content. Subtree nodes "NP", "VP", "NP" are added to the queues, respectively. At this time, the queues P and Q are not empty, the head nodes "NP" of the two queues are taken out, the subtree nodes of the title 3 and the subtree nodes of the title 4 are found to be the same, and the subtree nodes "NN" and "DNP" of the subtree nodes are added into the queues. And continuously taking out the head nodes VP in the queues P and Q, comparing the child nodes of the head nodes VP, and finding that the child nodes are VV and AS. Continue to fetch queues P and Q child nodes "NP ", the child nodes are compared, the child nodes of the title 3 are found to be 'NN', and the child nodes of the title 4 are found to be 'PN' and 'NN', and are not identical, so that the structure of the two phrase structure trees is judged to be different, and further the two titles are judged to be dissimilar.

According to another aspect of the present invention, there is provided a similar topic identification system based on a phrase structure tree, including: the title text preprocessing module, the phrase structure tree construction module and the title judgment module are shown in fig. 12.

The topic text preprocessing module is used for reading topic information to be compared and topic information of a topic library, carrying out corresponding text preprocessing on the topic text, analyzing knowledge point information, formula expression information and topic information in the topic, and transmitting the knowledge point information, the formula expression information and the topic information to the phrase structure tree building module; specific methods are described above.

The phrase structure tree building module is used for carrying out lexical analysis and grammar analysis on the questions according to the question information acquired by the question text preprocessing module, and building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module; specific methods are described above.

The topic judging module firstly performs pruning operation on the phrase structure tree according to phrase structure tree information of topics to be compared, then traverses the phrase structure tree in a layering manner, judges similar conditions of the topics according to tree structure information of the phrase structure tree and topic content information, and performs corresponding processing on the topics.

In the topic judgment module, pruning processing is firstly carried out on the phrase structure tree, the phrase structure tree is traversed, tree structure information of the phrase structure tree is compared, and then content information of the phrase structure tree is compared, wherein the comparison comprises comparison of knowledge point information related to a topic, formula information and topic specific content information. The method comprises the following specific steps:

(1) Pruning the phrase structure tree:

The portion labeled PRN in the phrase structure tree is the insert, we prune the insert portion, delete all its children and then merge the remaining portions together. The part marked as Y in the phrase structure tree is a word of a Chinese language, the part marked as O is a part of an anthropomorphic word, the part marked as PU is a sentence breaker node, the part marked as the above is pruned, all the child nodes are deleted, and the rest parts are combined together.

(2) Traversing the phrase structure tree:

while queue Q is not empty:

taking out the head node element of the queue Q;

accessing the node value;

(3) Comparing tree structure information of phrase structure tree:

the specific comparison process is as follows:

in the hierarchical traversal process of the phrase structure tree, two phrase structure trees T to be compared ₁ And T ₂ Initializing two queues P and Q, and initializing the root node S of two phrase structure trees ₁ And S is ₂ Respectively adding the queues P and Q, then taking out the head nodes of the two queues, namely S ₁ And S is ₂ Comparing if S ₁ And S is ₂ Content of (2) and S ₁ And S is ₂ Subtree node content C of (2) ₁ And C ₂ Are all the sameThen subtree node C ₁ And C ₂ Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.

(4) Comparing topic content information of the phrase structure tree:

Through the method or the system, the questions in the question bank can be compared one by one, so that the questions with the same or high similarity are deleted, the redundancy of the question bank is reduced, and the quality of the question bank is improved.

The technical content that is not elaborated on by the invention belongs to the technical fields that are known to one skilled in the art.

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. A similar topic identification method based on phrase structure tree is characterized by comprising the following steps:

s1, preprocessing a text according to an input question;

s2, constructing a phrase structure tree aiming at the topic information;

s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of questions according to the tree structure and leaf node content of the phrase structure tree;

the step of determining the similarity of the topics includes:

s34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the questions are different, otherwise, judging that the questions are the same; the step of comparing content information of the phrase structure tree includes:

setting different weight values for part-of-speech categories, calculating the similarity of two phrases, and judging that the questions are the same if the similarity is larger than a set threshold value, otherwise, judging that the questions are different;

the calculation formula of the similarity is as follows:

wherein w is _i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree _i Is two shortComparing the ith word of the word structure tree, if the ith word is the same as c _i =1, otherwise c _i ＝0。

s21, performing lexical analysis on the word sequence;

s22, carrying out grammar analysis on the word sequence;

s31, pruning the insert language;

s32, pruning the words without practical meaning.

5. The method according to claim 1, wherein in the step S34, the step of comparing content information of the phrase structure tree includes:

comparing whether the expression of the formulas contained in the phrase structure tree is the same, and if the expression of the formulas contained in the phrase structure tree is not the same, judging that the topics are not the same.

6. A similar topic identification system based on a phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judging module, wherein:

in the topic determination module, the step of determining topic similarity includes:

the calculation formula of the similarity is as follows:

7. The system of claim 6, wherein in the topic text preprocessing module, the method of preprocessing the topic text comprises;

8. The system of claim 6, wherein the topic determination module determines similarity of topics based on tree structure information of a phrase structure tree and topic content information, the method comprising: