CN111898343B - Similar topic identification method and system based on phrase structure tree - Google Patents

Similar topic identification method and system based on phrase structure tree Download PDF

Info

Publication number
CN111898343B
CN111898343B CN202010765054.2A CN202010765054A CN111898343B CN 111898343 B CN111898343 B CN 111898343B CN 202010765054 A CN202010765054 A CN 202010765054A CN 111898343 B CN111898343 B CN 111898343B
Authority
CN
China
Prior art keywords
phrase structure
structure tree
topic
information
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010765054.2A
Other languages
Chinese (zh)
Other versions
CN111898343A (en
Inventor
陈鹏鹤
卢宇
余胜泉
刘杰飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Normal University
Original Assignee
Beijing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Normal University filed Critical Beijing Normal University
Priority to CN202010765054.2A priority Critical patent/CN111898343B/en
Publication of CN111898343A publication Critical patent/CN111898343A/en
Application granted granted Critical
Publication of CN111898343B publication Critical patent/CN111898343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B7/00Electrically-operated teaching apparatus or devices working with questions and answers
    • G09B7/02Electrically-operated teaching apparatus or devices working with questions and answers of the type wherein the student is expected to construct an answer to the question which is presented or wherein the machine gives an answer to the question presented by a student
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a similar topic identification method and a similar topic identification system based on a phrase structure tree, comprising the following steps: s1, preprocessing a text according to an input question; s2, constructing a phrase structure tree aiming at the topic information; s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the questions according to the tree structure information and the leaf node content information of the phrase structure tree. The method mainly aims at the comparison and identification problem of similar subjects of primary and secondary schools, constructs a phrase structure tree for the subjects to be compared, and evaluates the similarity condition of the subjects through comparison of the phrase structure tree, so that redundancy of a subject library is reduced.

Description

Similar topic identification method and system based on phrase structure tree
Technical Field
The invention relates to the technical field of education, in particular to a similar topic identification method and system based on phrase structure tree.
Background
The question data are important components in educational resources, and in the learning and teaching process, exercise questions used by students in daily life and examination questions used for testing belong to the question data. With the development of computer and internet technology, the electronic storage of the topic data in the middle and primary school education is basically realized. The question data can help students deepen learning and understanding of knowledge in the learning process, can help teachers to master knowledge in time, learn learning progress of the students, help the students to check for defects and mend leaks, and improve learning efficiency.
For the construction of the multidisciplinary subject library of primary and secondary schools, on one hand, the updating and the management of multidisciplinary subject data of primary and secondary schools are convenient, and on the other hand, the working strength of teacher teaching can be reduced. With the continuous updating and increasing of the question data in the question bank, two or more questions can be the same or similar. On one hand, the occurrence of the same or similar questions makes the question bank redundant and huge, and more storage and calculation resources are required to be consumed; on the other hand, the retrieval and use efficiency of the question bank data can be affected.
It is therefore necessary to screen the questions in the question bank and remove the same or similar questions. In the task of identifying similar topics, evaluation and calculation of the similarity of two topics are one of the most important. The current method for calculating the similarity of the questions is to treat the questions to be compared as two continuous character strings. One way is to evaluate the similarity of the questions by the distance measurement of the character string, for example, after the characters are expressed as vectors, the cosine included angle or Euclidean distance between the two vectors is calculated; another way is to reduce the dimensions of the text, e.g. to generate a SimHash value, i.e. a fingerprint (fingerprint), for the character string, by which the similarity of the two character strings is evaluated.
It should be noted that the above methods treat the title as a whole string, and in practice, a complete title often contains different expressions, such as a common character expression and a formula expression. If the whole title is simply processed according to the character string, the similarity of the title cannot be accurately estimated. And some topics are the same in character, but different sentence structures lead to different represented topic information, and are actually different topics. Such as the inverse of the reciprocal of "-3" and the reciprocal of the inverse of "-3". There is a need for a method that can more accurately determine whether the topics are the same. The phrase structure tree is a structure which can well represent key positions and key information in sentences.
Disclosure of Invention
Aiming at the problems, the invention provides a similar topic identification method and a similar topic identification system based on a phrase structure tree, which are characterized in that topic data are subjected to text preprocessing, knowledge point information and formula information related to topics are analyzed, then a phrase structure tree is constructed aiming at the topic information, the constructed phrase structure tree is subjected to pruning processing, hierarchical traversal is carried out, and the structure information and leaf node content information of the tree are compared to further realize the comparison of the similarity between two topics.
According to one aspect of the present invention, a method for identifying similar topics based on a phrase structure tree is provided, comprising the steps of:
s1, preprocessing a text according to an input question;
s2, constructing a phrase structure tree aiming at the topic information;
s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the questions according to the tree structure and the leaf node content of the phrase structure tree.
2. The method according to claim 1, wherein in the step S1, text preprocessing is performed on an input title, including:
s11, unified coding processing, word segmentation, stop word removal and useless and illegal character removal, so that a word sequence is obtained;
s12, analyzing and identifying knowledge point information related to the questions according to the keywords in the questions;
s13, analyzing formula expression information in the questions according to the regular expression.
3. The method according to claim 2, wherein in the step S2, the step of constructing a phrase structure tree for the topic information includes:
s21, performing lexical analysis on the word sequence;
s22, carrying out grammar analysis on the word sequence;
s23, constructing a phrase structure tree according to the results of lexical analysis and grammar analysis.
4. The method according to claim 1, wherein in the step S3, the step of pruning includes:
s31, pruning the insert language;
s32, pruning the words without practical meaning.
5. The method according to claim 4, wherein in the step S3, the step of determining the similarity of the topics includes:
s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, judging that the topics are different, otherwise, entering step S34;
s34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the questions are different, otherwise, judging that the questions are the same.
6. The method according to claim 5, wherein in the step S34, the step of comparing the content information of the phrase structure tree includes:
comparing whether knowledge point information related to the questions is the same or not, if yes, judging that the questions are different;
comparing whether the expression of the formulas contained in the phrase structure tree is the same or not, and if the expression of the formulas contained in the phrase structure tree is different, judging that the topics are different;
and setting different weight values for the parts of speech, calculating the similarity of the two phrases, and judging that the questions are the same if the similarity is larger than a set threshold value, otherwise, judging that the questions are different.
7. The method of claim 6, wherein the similarity is calculated by the formula:
Figure GDA0004257782560000031
wherein w is i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree i Comparing the ith word of the two phrase structure trees, if the ith word is the same c i =1, otherwise c i =0。
8. A similar topic identification system based on a phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judging module, wherein:
the topic text preprocessing module is used for reading topic information to be compared and topic information of a topic library, carrying out corresponding text preprocessing on the topic text, analyzing knowledge point information and formula expression information in the topic, and finally transmitting the topic information to the phrase structure tree building module;
the phrase structure tree building module is used for carrying out lexical analysis and grammar analysis on the questions according to the question information acquired by the question text preprocessing module, and building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module;
the topic judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the topics to be compared, traversing the phrase structure tree in a layering manner, judging the similarity of the topics according to the tree structure information of the phrase structure tree and the topic content information, and carrying out corresponding processing on the topics;
9. the system of claim 8, wherein in the topic text preprocessing module, the method of preprocessing the topic text comprises;
unified coding processing, word segmentation, stop word removal, useless and illegal character removal, and word sequence obtaining;
analyzing and identifying knowledge point information related in the questions according to the keywords in the questions;
and analyzing the formula expression information in the title according to the regular expression.
10. The system of claim 8, wherein the topic determination module determines similarity of topics based on tree structure information of a phrase structure tree and topic content information, the method comprising:
comparing whether knowledge point information related to the questions is the same or not, if yes, judging that the questions are different;
comparing whether the expression of the formulas contained in the phrase structure tree is the same or not, and if the expression of the formulas contained in the phrase structure tree is different, judging that the topics are different;
and setting different weight values for the parts of speech, calculating the similarity between phrases, and judging that the topics are the same if the similarity is larger than a set threshold value, otherwise, judging that the topics are different.
The beneficial effects of the invention are as follows:
(1) Aiming at the topic characterization in the similar topic comparison process, the phrase structure tree is utilized to carry out structural analysis on the topics, so that the structural characterization of fine granularity of the topic description is realized.
(2) Aiming at similar topic comparison, the invention performs comparison on topics at the topic structure level by pruning the phrase structure tree, extracting the main part of the phrase structure tree and performing comparison on the phrase structure tree on the basis of the phrase structure tree representation.
(3) Aiming at comparison of similar topics, the accuracy of similarity judgment is improved by comparing knowledge point information, formula information and detailed granularity of specific text information contained in the comparison topics on the basis of phrase structure tree comparison.
Drawings
FIG. 1 is a flow diagram of a method for identifying similar topics based on a phrase structure tree according to one embodiment of the present invention;
FIG. 2 is a flow diagram of a method for topic text preprocessing in accordance with one embodiment of the present invention;
FIG. 3 is a flow diagram of a topic construction phrase structure tree in accordance with one embodiment of the present invention;
FIG. 4 is a schematic diagram of a phrase structure tree;
FIG. 5 is a schematic diagram of a phrase structure tree;
FIG. 6 is a schematic diagram of a phrase structure tree;
FIG. 7 is a flow chart of a topic similarity determination in accordance with one embodiment of the present invention;
FIG. 8 is a schematic diagram of a phrase structure tree;
FIG. 9 is a schematic diagram of a phrase structure tree;
FIG. 10 is a schematic diagram of a phrase structure tree;
FIG. 11 is a schematic diagram of a phrase structure tree;
FIG. 12 is a schematic diagram of a similar topic identification system based on a phrase structure tree in accordance with an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be examined and fully described below with reference to the accompanying drawings in the embodiments of the present invention, wherein the embodiments described below are some embodiments, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The phrase structure tree refers to outputting the result of phrase structure analysis of sentences in a tree structure, namely, analyzing each input sentence by constructing the phrase tree, wherein the phrase structure tree can represent not only the grammar relation of the sentences but also the hierarchy of the sentences. From the phrase structure tree, the phrase structure between sentences can be quickly analyzed, such as the node labeled NP indicates that the portion is a noun phrase. In the phrase structure tree, when the nearest father node of two phrases belongs to the same node, the two phrases are called as the same-level phrases, and besides, the phrase structure tree can analyze parallel structures, clause structures and the like in sentences.
The invention will be described in detail with reference to the drawings and detailed description. According to one aspect of the present invention, a similar topic identification method based on phrase structure tree is provided, as shown in fig. 1, comprising the following steps:
s1, preprocessing a text according to an input question;
s2, constructing a phrase structure tree based on the topic information;
s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of the topics according to the tree structure of the phrase structure tree and the similarity of the leaf node contents.
In step S1, the presentation forms of the topics are often different depending on the storage method and the application environment. For example, the encoding modes of the questions are different from GB2312, GBK, UTF-8 and the like for different display requirements of the questions. Therefore, unified text pretreatment is needed for the topics to be compared, so that the subsequent similarity comparison of the topics is facilitated, and the accuracy of the similarity comparison is improved.
As shown in fig. 2, the text preprocessing operation mainly includes the following operations:
(1) Knowledge point information related in the title is analyzed according to the keywords in the title.
(2) The formula expression information in the topic information is identified by a regular expression.
(3) Unified coding treatment: the unified title coding format is UTF-8;
Figure GDA0004257782560000061
for character normalization, for example, characters such as characters "4", "a" and the like may exist in the title, and the characters are normalized to "4", "a";
Figure GDA0004257782560000071
converting various types of spaces into Chinese spaces, converting various types of punctuation into Chinese punctuation, e.g., english "? "convert to Chinese"? ";
Figure GDA0004257782560000072
uniformly converting English characters in the topics into a lowercase format;
(4) Word segmentation:
Figure GDA0004257782560000073
the Chinese characters of the topic content need to be segmented firstly;
Figure GDA0004257782560000074
after the topic is segmented, the topic content is converted into a sequence of word representations separated by spaces;
(5) Removing stop words:
Figure GDA0004257782560000075
to improve the similarity comparison of topicsPrecision, some words that are less important to the topic may be removed. A common stop phrase is used herein.
(6) Removing useless and illegal symbols:
Figure GDA0004257782560000076
removing empty brackets from the title and one or more spaces in the brackets, such as:
"()", etc.;
Figure GDA0004257782560000077
removing redundant or unmatched symbols at the end of the title description, such as: "=",
"(", "[", "{" etc.;
Figure GDA0004257782560000078
the nonsensical case of only sequence numbers in the description of the title, such as "A, B, CD", "A, B
C. D, "A, B, C", etc.
Figure GDA0004257782560000079
Removing line-feed symbols, tabulated symbols, underlines and illegal characters such as "≡", "\xa0"
"\x2", "\x0b", "\x0c", "\x0d" \x0f ", etc.;
Figure GDA00042577825600000710
removing the messy code characters outside the character set, and enabling characters which cannot be normally displayed to be like emoji symbols;
the operations are not time-sequential, and the person skilled in the art can set their specific execution steps as desired.
Knowledge points refer to general terms of a certain knowledge, in particular to knowledge on textbooks or on examination. For example, the problem of "knowing the right-angle side length of a right triangle and calculating the bevel side length thereof" is that the knowledge point belongs to the Pythagorean theorem. And matching in a predefined knowledge point base according to keywords in the topic information to obtain knowledge point information related to the topic.
The knowledge point library comprises information of each knowledge point appearing in middle and primary schools and keywords of related description related to each knowledge point (the keywords are used for describing specific knowledge under the knowledge point, for example, the knowledge point 'three-solution function' comprises keywords such as 'random angle', 'radian', 'sine', 'cosine', 'tangent', and the like), and the knowledge point library structure and example information are shown in table 1.
TABLE 1 knowledge Point library Structure and examples
Figure GDA0004257782560000081
In one embodiment, the input questions are as follows:
title: equation x 2 The opposite number of roots of +6x+9=0 is ()?
Uniformly converting English brackets into Chinese brackets through text pretreatment; removing the blank; the empty brackets are removed.
The title information is analyzed, and the formula expression of the obtained title is as follows: x is x 2 +6*x+9=0
And matching the topic information to obtain knowledge point information of a keyword equation in the topic in a knowledge point base as a function and an equation.
The topic information is word sequence, knowledge point information and formula information generated after the processing.
In step S2, as shown in fig. 3, a phrase structure tree is constructed for the topic information, mainly including the following steps.
The construction of the phrase structure tree is to perform lexical analysis and grammatical analysis on word sequences in the topic information on the basis of word segmentation, then construct the phrase structure tree, represent the relation among words by each tree node, and the content of the leaf nodes is the word segmentation in the topic information so as to convert the topic information into the phrase structure tree representation.
In computer science and technology, a phrase structure tree is a data structure used to express the syntactic structure of a sentence. We apply this idea to the processing of topic information by constructing the topic information as a tree structure representation in which leaf nodes are associated with words in the input sentence and other intermediate node content is a marker of phrase components. If NP indicates that the phrase is a noun phrase, VP indicates that the phrase is a verb phrase. The phrase structure tree is constructed mainly through lexical analysis and grammar analysis.
The lexical analysis is a process of matching lexical rules on the read-in character strings, and the lexical rules are determined by scanning the text to be analyzed from left to right character by character, analyzing and classifying the parts of speech in the word segmentation result based on 12 categories of words. The part of speech categories are nouns, verbs, adjectives, numerical words, adjectives, pronouns, adverbs, prepositions, conjunctions, auxiliary words, exclaments and personification respectively.
And the grammar analysis is to take the character stream after word segmentation as input and identify whether the word segmentation sequence given by the lexical analysis is a sentence conforming to the grammar rule. The modern Chinese grammar has various sentence structures, such as a main-predicate structure, a movable-predicate structure and the like, and the grammar analysis mainly analyzes sentence structure information in the topic information.
Through the phrase structure tree, the relation of all parts among sentences can be clearly cleared. Wherein the relationships represented by the nodes in the phrase structure tree are as shown in Table 2:
table 2 node representations and meanings in phrase structure tree
Figure GDA0004257782560000091
Figure GDA0004257782560000101
For example, the sentence "loved classmates sit on a flying high-speed rail", and the phrase structure tree structure is constructed as shown in fig. 4. It is expressed in terms of a tree structure as follows:
[ S [ VP [ CP [ ADJP loved ] [ NP classmates ] ] [ VV sits on ] ] [ NP [ CP [ VP fly ] [ NN high-speed rail ] ] [ LC ] ] ] ]
For another example, topic information: "equation x 2 The opposite number of roots of +6x+9=0 is "the phrase structure tree structure is constructed as shown in fig. 5, the phrase structure tree storing representation: [ S [ NP [ NN ] equation][NRx^2+6*x+9=0][ DNP ]]][ NP [ NN root ]][ DNP ]][ NN inverse number ]]][ VV is]]
Wherein the formula information is: x 2+6 x+9=0
The knowledge point information involved is: functions and equations.
In another example, the topic information: the inverse of the reciprocal of "-3 is" and the phrase structure tree structure is constructed as shown in FIG. 6. The phrase structure tree stores representations: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN opposite number of DNP ] ] [ VV is ]
Wherein the formula information is: -3
The knowledge point information involved is: functions and equations.
In step S3, pruning is performed on the phrase structure tree, the phrase structure tree is traversed, and the similarity of the topics is determined according to the tree structure information and the topic content information of the phrase structure tree.
As shown in fig. 7, determining the similarity of topics based on the phrase structure tree mainly includes two operations, namely pruning the phrase structure tree and traversing the phrase structure tree and comparing tree structure information of the phrase structure tree, and then comparing content information of the phrase structure tree, including comparison of knowledge point information related to the topics with formula information and topic specific content information. The method comprises the following specific steps:
(1) Pruning the phrase structure tree:
pruning operations are performed on the phrase structure tree, including pruning of insert words, pruning of words with no practical meaning, such as word and phrase, phonetic words, and punctuation nodes. Wherein the insert words belong to independent words in the sentence, and the sentence can be simplified by removing the insert words. Words without practical meaning do not contain or contain a small amount of semantic information in sentences, and the meaning of sentence expression is not affected after the words are removed.
The portion labeled PRN in the phrase structure tree is the insert, we prune the insert portion, delete all its children and then merge the remaining portions together. The part marked as Y in the phrase structure tree is a word of a Chinese language, the part marked as O is a part of an anthropomorphic word, the part marked as PU is a sentence breaker node, the part marked as the above is pruned, all the child nodes are deleted, and the rest parts are combined together. The pre-pruning phrase structure tree is shown in fig. 8, and the phrase structure tree store is expressed as:
[ S [ NP [ NN [ small-sized ] [ VP [ VV ] with ] [ QP [ CD three ] [ M ] ] [ NN rabbit ] [ PU, ] ] [ VP [ D ] and ] [ VV ] to obtain ] [ QP [ CD two ] [ M ] with [ VP [ P ] sharing ] [ QP [ CD several ] [ M ] with [ PU? ]]]
The pruned phrase structure tree is shown in fig. 9, and the phrase structure tree store is expressed as:
[ S [ NP [ NN small amine ] [ VP [ VV has ] [ QP [ CD three ] [ M ] only ] ] [ NN rabbit ] ] [ VP [ VV gives [ QP [ CD two ] [ M ] ] [ VP [ P sharing ] [ QP [ CD several ] [ M ] ] ] ]
(2) Traversing the phrase structure tree:
the traversing of the phrase structure tree adopts a tree hierarchy traversing method, and the specific algorithm is described as follows:
initializing a queue Q, and adding a root node S of the phrase structure tree into the queue;
while queue Q is not empty:
taking out the head node element of the queue Q;
accessing the node value;
if the child node of the node is not null and the child node is not a leaf node, the child node is added to the queue.
(3) Comparing tree structure information of phrase structure tree:
comparing tree structure information of the phrase structure tree, if the tree structure information of the phrase structure tree is different, judging that the topics are different, otherwise, continuously comparing content information of the phrase structure tree;
the specific comparison process is as follows:
during the hierarchical traversal of the phrase structure tree,two phrase structure trees T to be compared 1 And T 2 Initializing two queues P and Q, and initializing the root node S of two phrase structure trees 1 And S is 2 Respectively adding the queues P and Q, then taking out the head nodes of the two queues, namely S 1 And S is 2 Comparing if S 1 And S is 2 Content of (2) and S 1 And S is 2 Subtree node content C of (2) 1 And C 2 All are the same, then the subtree node C 1 And C 2 Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.
After the comparison is finished, whether the two queues P and Q are empty or not is judged, if the two queues are not empty, the head node is continuously fetched from the queues, and the comparison is continued. If one of the two queues is empty and the other queue is not empty, the fact that the structures of the two phrase structure trees are different is judged. If both queues are empty, the structure comparison of the phrase structure tree is ended.
(4) Comparing topic content information of the phrase structure tree:
the method for comparing topic content information of phrase structure tree is as follows:
firstly, comparing whether knowledge point information related to two topics is the same or not, and if the knowledge point information is different, judging that the topics are different; if the knowledge point information is the same, continuing to compare whether the formula expressions contained in the phrase structure tree are the same, and if the formula expressions are different, judging that the topics are different; and if the formula expression information is the same, comparing the specific content information of the title. In the comparison of the question contents, different weight values are set for the part-of-speech categories, then the similarity of the two phrases is calculated, if the similarity is larger than a set threshold value, the questions are judged to be the same, otherwise, the questions are judged to be different. The calculation formula of the similarity score is:
Figure GDA0004257782560000121
wherein w is i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree i Comparing the ith word of the two phrase structure trees, if the ith word is the comparison result of the ith word of the two phrase structure treesIdentical c i =1, otherwise c i =0。
In the comparison of the topic "contribution of the four inventions to the world is significant" with the topic "Yao Mou contribution to the sports world," the noun portion is { four inventions, yao Mou, world, sports world, contribution }, the verb portion is { pair, give, make }, the adjective is { outstanding, great }, specifically expressed as follows:
Figure GDA0004257782560000131
in one embodiment, the weights of nouns, verbs and adjectives are divided into 0.2, 0.3 and 0.1, and the threshold value is set to 0.8
Figure GDA0004257782560000132
Therefore score= (0.2×0+0.3×0+0.2×0+0.3×1+0.1×0.1+0.2×1)/(0.2×1+0.3×1+0.2×1+0.3×1+0.1×1+0.2×1) =0.4167, which is smaller than the set threshold value 0.8, it is determined that the two phrase contents are different, i.e., the titles are different.
In another example, the two topics to be compared are specifically as follows:
title 1: equation x 2 The opposite number of roots of +6x+9=0 is
Title 2: the inverse of the reciprocal of-3 is
The constructed phrase structure tree is expressed as:
title 1: [ S [ NP [ NN equation ] [ nrx2+6 x+9=0 ] [ DNP ] ] [ NP [ NN root ] [ NN opposite number of DNP ] ] [ VV is ] ]
Title 2: [ S [ NP [ NR-3] [ DNP ] ] [ NP [ NN reciprocal ] [ NN opposite number of DNP ] ] [ VV is ]
Pruning is firstly carried out on the phrase structure tree, and no part which can be pruned is found. Then comparing the structure of phrase structure tree, initializing two queues P and Q by hierarchical traversal, and connecting the root node S of two trees 1 And S is 2 Respectively adding into the queues P and Q, wherein the queues P and Q are not empty, and the head node S is connected with the queue 1 And S is 2 The two nodes are taken out as S, and three subtree nodes NP, VV are respectively arranged, and the subtree nodes have the same content. Subtree nodes "NP", "VV" are added to queues P and Q, respectively. At this time, the queues P and Q are not empty, the head nodes NP of the two queues are taken out, three nodes NN, NR and DNP are found in the subtree of the question 1, and only two nodes NR and DNP are found in the subtree of the question 2, so that the structure of the two phrase structure trees is different, and the dissimilarity of the two questions is further determined.
In another example, the two topics to be compared are specifically as follows:
title 3: the authors of the western-style diary praise against the spirit
Title 4: the author of the western-style diary tells what stories
The tree structure is shown in fig. 10 and 11. The constructed phrase structure tree structure is as follows:
title 3: [ S [ NP [ NP [ NN West-Loose-note ] [ DNP ] ] [ NN-author ] ] [ VP [ VV-praise ] [ AS ] ] [ NP [ NN-counter ] [ NN-spirit ] ] ]
Title 4: [ S [ NP [ NP [ NN West-Loose-note ] [ DNP ] ] [ NN-author ] ] [ VP [ VV-narration ] [ AS ] ] [ NP [ PN-what ] [ NN-story ] ] ])
Pruning is firstly carried out on the phrase structure tree, and no part which can be pruned is found. Then comparing the structure of phrase structure tree, initializing two queues P and Q by hierarchical traversal, and connecting the root node S of two trees 1 And S is 2 Respectively adding into the queues P and Q, wherein the queues P and Q are not empty, and the head node S is connected with the queue 1 And S is 2 The node contents are taken out as S, and three subtree nodes NP, VP and NP are respectively arranged, and the subtree nodes have the same content. Subtree nodes "NP", "VP", "NP" are added to the queues, respectively. At this time, the queues P and Q are not empty, the head nodes "NP" of the two queues are taken out, the subtree nodes of the title 3 and the subtree nodes of the title 4 are found to be the same, and the subtree nodes "NN" and "DNP" of the subtree nodes are added into the queues. And continuously taking out the head nodes VP in the queues P and Q, comparing the child nodes of the head nodes VP, and finding that the child nodes are VV and AS. Continue to fetch queues P and Q child nodes "NP ", the child nodes are compared, the child nodes of the title 3 are found to be 'NN', and the child nodes of the title 4 are found to be 'PN' and 'NN', and are not identical, so that the structure of the two phrase structure trees is judged to be different, and further the two titles are judged to be dissimilar.
According to another aspect of the present invention, there is provided a similar topic identification system based on a phrase structure tree, including: the title text preprocessing module, the phrase structure tree construction module and the title judgment module are shown in fig. 12.
The topic text preprocessing module is used for reading topic information to be compared and topic information of a topic library, carrying out corresponding text preprocessing on the topic text, analyzing knowledge point information, formula expression information and topic information in the topic, and transmitting the knowledge point information, the formula expression information and the topic information to the phrase structure tree building module; specific methods are described above.
The phrase structure tree building module is used for carrying out lexical analysis and grammar analysis on the questions according to the question information acquired by the question text preprocessing module, and building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module; specific methods are described above.
The topic judging module firstly performs pruning operation on the phrase structure tree according to phrase structure tree information of topics to be compared, then traverses the phrase structure tree in a layering manner, judges similar conditions of the topics according to tree structure information of the phrase structure tree and topic content information, and performs corresponding processing on the topics.
In the topic judgment module, pruning processing is firstly carried out on the phrase structure tree, the phrase structure tree is traversed, tree structure information of the phrase structure tree is compared, and then content information of the phrase structure tree is compared, wherein the comparison comprises comparison of knowledge point information related to a topic, formula information and topic specific content information. The method comprises the following specific steps:
(1) Pruning the phrase structure tree:
pruning operations are performed on the phrase structure tree, including pruning of insert words, pruning of words with no practical meaning, such as word and phrase, phonetic words, and punctuation nodes. Wherein the insert words belong to independent words in the sentence, and the sentence can be simplified by removing the insert words. Words without practical meaning do not contain or contain a small amount of semantic information in sentences, and the meaning of sentence expression is not affected after the words are removed.
The portion labeled PRN in the phrase structure tree is the insert, we prune the insert portion, delete all its children and then merge the remaining portions together. The part marked as Y in the phrase structure tree is a word of a Chinese language, the part marked as O is a part of an anthropomorphic word, the part marked as PU is a sentence breaker node, the part marked as the above is pruned, all the child nodes are deleted, and the rest parts are combined together.
(2) Traversing the phrase structure tree:
the traversing of the phrase structure tree adopts a tree hierarchy traversing method, and the specific algorithm is described as follows:
initializing a queue Q, and adding a root node S of the phrase structure tree into the queue;
while queue Q is not empty:
taking out the head node element of the queue Q;
accessing the node value;
if the child node of the node is not null and the child node is not a leaf node, the child node is added to the queue.
(3) Comparing tree structure information of phrase structure tree:
comparing tree structure information of the phrase structure tree, if the tree structure information of the phrase structure tree is different, judging that the topics are different, otherwise, continuously comparing content information of the phrase structure tree;
the specific comparison process is as follows:
in the hierarchical traversal process of the phrase structure tree, two phrase structure trees T to be compared 1 And T 2 Initializing two queues P and Q, and initializing the root node S of two phrase structure trees 1 And S is 2 Respectively adding the queues P and Q, then taking out the head nodes of the two queues, namely S 1 And S is 2 Comparing if S 1 And S is 2 Content of (2) and S 1 And S is 2 Subtree node content C of (2) 1 And C 2 Are all the sameThen subtree node C 1 And C 2 Queues P and Q are added. Otherwise, directly judging that the structures of the two phrase structure trees are different.
After the comparison is finished, whether the two queues P and Q are empty or not is judged, if the two queues are not empty, the head node is continuously fetched from the queues, and the comparison is continued. If one of the two queues is empty and the other queue is not empty, the fact that the structures of the two phrase structure trees are different is judged. If both queues are empty, the structure comparison of the phrase structure tree is ended.
(4) Comparing topic content information of the phrase structure tree:
the method for comparing topic content information of phrase structure tree is as follows:
firstly, comparing whether knowledge point information related to two topics is the same or not, and if the knowledge point information is different, judging that the topics are different; if the knowledge point information is the same, continuing to compare whether the formula expressions contained in the phrase structure tree are the same, and if the formula expressions are different, judging that the topics are different; and if the formula expression information is the same, comparing the specific content information of the title. In the comparison of the question contents, different weight values are set for the part-of-speech categories, then the similarity of the two phrases is calculated, if the similarity is larger than a set threshold value, the questions are judged to be the same, otherwise, the questions are judged to be different. The calculation formula of the similarity score is:
Figure GDA0004257782560000171
wherein w is i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree i Comparing the ith word of the two phrase structure trees, if the ith word is the same c i =1, otherwise c i =0。
Through the method or the system, the questions in the question bank can be compared one by one, so that the questions with the same or high similarity are deleted, the redundancy of the question bank is reduced, and the quality of the question bank is improved.
The technical content that is not elaborated on by the invention belongs to the technical fields that are known to one skilled in the art.
While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims (8)

1. A similar topic identification method based on phrase structure tree is characterized by comprising the following steps:
s1, preprocessing a text according to an input question;
s2, constructing a phrase structure tree aiming at the topic information;
s3, pruning the phrase structure tree, traversing the phrase structure tree, and judging the similarity of questions according to the tree structure and leaf node content of the phrase structure tree;
the step of determining the similarity of the topics includes:
s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, judging that the topics are different, otherwise, entering step S34;
s34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the questions are different, otherwise, judging that the questions are the same; the step of comparing content information of the phrase structure tree includes:
setting different weight values for part-of-speech categories, calculating the similarity of two phrases, and judging that the questions are the same if the similarity is larger than a set threshold value, otherwise, judging that the questions are different;
the calculation formula of the similarity is as follows:
Figure FDA0004257782550000011
wherein w is i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree i Is two shortComparing the ith word of the word structure tree, if the ith word is the same as c i =1, otherwise c i =0。
2. The method according to claim 1, wherein in the step S1, text preprocessing is performed on an input title, including:
s11, unified coding processing, word segmentation, stop word removal and useless and illegal character removal, so that a word sequence is obtained;
s12, analyzing and identifying knowledge point information related to the questions according to the keywords in the questions;
s13, analyzing formula expression information in the questions according to the regular expression.
3. The method according to claim 2, wherein in the step S2, the step of constructing a phrase structure tree for the topic information includes:
s21, performing lexical analysis on the word sequence;
s22, carrying out grammar analysis on the word sequence;
s23, constructing a phrase structure tree according to the results of lexical analysis and grammar analysis.
4. The method according to claim 1, wherein in the step S3, the step of pruning includes:
s31, pruning the insert language;
s32, pruning the words without practical meaning.
5. The method according to claim 1, wherein in the step S34, the step of comparing content information of the phrase structure tree includes:
comparing whether knowledge point information related to the questions is the same or not, if yes, judging that the questions are different;
comparing whether the expression of the formulas contained in the phrase structure tree is the same, and if the expression of the formulas contained in the phrase structure tree is not the same, judging that the topics are not the same.
6. A similar topic identification system based on a phrase structure tree is characterized by comprising a topic text preprocessing module, a phrase structure tree building module and a topic judging module, wherein:
the topic text preprocessing module is used for reading topic information to be compared and topic information of a topic library, carrying out corresponding text preprocessing on the topic text, analyzing knowledge point information and formula expression information in the topic, and finally transmitting the topic information to the phrase structure tree building module;
the phrase structure tree building module is used for carrying out lexical analysis and grammar analysis on the questions according to the question information acquired by the question text preprocessing module, and building a phrase structure tree by combining knowledge point information and formula expression information in the questions and transmitting the phrase structure tree to the question judging module;
the topic judging module is used for pruning the phrase structure tree according to the phrase structure tree information of the topics to be compared, traversing the phrase structure tree in a layering manner, judging the similarity of the topics according to the tree structure information of the phrase structure tree and the topic content information, and carrying out corresponding processing on the topics;
in the topic determination module, the step of determining topic similarity includes:
s33, comparing the structures of the topic phrase structure trees, if the tree structure information of the phrase structure trees is different, judging that the topics are different, otherwise, entering step S34;
s34, comparing whether the content information of the phrase structure tree is the same, if not, judging that the questions are different, otherwise, judging that the questions are the same; the step of comparing content information of the phrase structure tree includes:
setting different weight values for part-of-speech categories, calculating the similarity of two phrases, and judging that the questions are the same if the similarity is larger than a set threshold value, otherwise, judging that the questions are different;
the calculation formula of the similarity is as follows:
Figure FDA0004257782550000031
wherein w is i C, weighting corresponding to part of speech of the ith segmentation word in leaf nodes of phrase structure tree i Comparing the ith word of the two phrase structure trees, if the ith word is the same c i =1, otherwise c i =0。
7. The system of claim 6, wherein in the topic text preprocessing module, the method of preprocessing the topic text comprises;
unified coding processing, word segmentation, stop word removal, useless and illegal character removal, and word sequence obtaining;
analyzing and identifying knowledge point information related in the questions according to the keywords in the questions;
and analyzing the formula expression information in the title according to the regular expression.
8. The system of claim 6, wherein the topic determination module determines similarity of topics based on tree structure information of a phrase structure tree and topic content information, the method comprising:
comparing whether knowledge point information related to the questions is the same or not, if yes, judging that the questions are different;
comparing whether the expression of the formulas contained in the phrase structure tree is the same, and if the expression of the formulas contained in the phrase structure tree is not the same, judging that the topics are not the same.
CN202010765054.2A 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree Active CN111898343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010765054.2A CN111898343B (en) 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010765054.2A CN111898343B (en) 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree

Publications (2)

Publication Number Publication Date
CN111898343A CN111898343A (en) 2020-11-06
CN111898343B true CN111898343B (en) 2023-07-14

Family

ID=73184054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010765054.2A Active CN111898343B (en) 2020-08-03 2020-08-03 Similar topic identification method and system based on phrase structure tree

Country Status (1)

Country Link
CN (1) CN111898343B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006190101A (en) * 2005-01-06 2006-07-20 Csk Holdings Corp Natural language analysis device, method and program
EP2439684A2 (en) * 2010-10-06 2012-04-11 The Chancellor, Masters and Scholars of the University of Cambridge Automated assessment of examination scripts
CN105335528A (en) * 2015-12-01 2016-02-17 中国计量学院 Customized product similarity judgment method based on product structure
CN108334493A (en) * 2018-01-07 2018-07-27 深圳前海易维教育科技有限公司 A kind of topic knowledge point extraction method based on neural network
CN108509484A (en) * 2018-01-31 2018-09-07 腾讯科技(深圳)有限公司 Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing
CN109872162A (en) * 2018-11-21 2019-06-11 阿里巴巴集团控股有限公司 A kind of air control classifying identification method and system handling customer complaint information
CN110853422A (en) * 2018-08-01 2020-02-28 世学(深圳)科技有限公司 Immersive language learning system and learning method thereof
CN111241239A (en) * 2020-01-07 2020-06-05 科大讯飞股份有限公司 Method for detecting repeated questions, related device and readable storage medium
CN111259632A (en) * 2020-02-10 2020-06-09 暗物智能科技(广州)有限公司 Semantic alignment-based tree structure mathematical application problem solving method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073745A1 (en) * 2005-09-23 2007-03-29 Applied Linguistics, Llc Similarity metric for semantic profiling
JP4845575B2 (en) * 2006-04-13 2011-12-28 日本放送協会 Similarity evaluation apparatus and program
CN104834729B (en) * 2015-05-14 2018-08-10 作业帮教育科技(北京)有限公司 Topic recommends method and topic recommendation apparatus
CN106651696B (en) * 2016-11-16 2020-10-27 福建天泉教育科技有限公司 Approximate question pushing method and system
CN107818082B (en) * 2017-09-25 2020-12-04 沈阳航空航天大学 Semantic role recognition method combined with phrase structure tree
CN108345468B (en) * 2018-01-29 2021-06-01 华侨大学 Programming language code duplication checking method based on tree and sequence similarity
CN109947836B (en) * 2019-03-21 2022-05-24 江西风向标教育科技有限公司 English test paper structuring method and device
CN110222678B (en) * 2019-04-30 2022-02-01 宜春宜联科技有限公司 Topic analysis method, system, readable storage medium and electronic device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006190101A (en) * 2005-01-06 2006-07-20 Csk Holdings Corp Natural language analysis device, method and program
EP2439684A2 (en) * 2010-10-06 2012-04-11 The Chancellor, Masters and Scholars of the University of Cambridge Automated assessment of examination scripts
CN105335528A (en) * 2015-12-01 2016-02-17 中国计量学院 Customized product similarity judgment method based on product structure
CN108334493A (en) * 2018-01-07 2018-07-27 深圳前海易维教育科技有限公司 A kind of topic knowledge point extraction method based on neural network
CN108509484A (en) * 2018-01-31 2018-09-07 腾讯科技(深圳)有限公司 Grader is built and intelligent answer method, apparatus, terminal and readable storage medium storing program for executing
CN110853422A (en) * 2018-08-01 2020-02-28 世学(深圳)科技有限公司 Immersive language learning system and learning method thereof
CN109872162A (en) * 2018-11-21 2019-06-11 阿里巴巴集团控股有限公司 A kind of air control classifying identification method and system handling customer complaint information
CN111241239A (en) * 2020-01-07 2020-06-05 科大讯飞股份有限公司 Method for detecting repeated questions, related device and readable storage medium
CN111259632A (en) * 2020-02-10 2020-06-09 暗物智能科技(广州)有限公司 Semantic alignment-based tree structure mathematical application problem solving method and system

Also Published As

Publication number Publication date
CN111898343A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN110727796B (en) Multi-scale difficulty vector classification method for graded reading materials
US20200104359A1 (en) System and method for comparing plurality of documents
KR20080021017A (en) Comparing text based documents
US11068653B2 (en) System and method for context-based abbreviation disambiguation using machine learning on synonyms of abbreviation expansions
CN109271524B (en) Entity linking method in knowledge base question-answering system
JP2011118689A (en) Retrieval method and system
Rababah et al. An automated scoring approach for Arabic short answers essay questions
Agrawal et al. Semantic analysis of natural language queries using domain ontology for information access from database
CN111626042A (en) Reference resolution method and device
CN112632272B (en) Microblog emotion classification method and system based on syntactic analysis
Sarhan et al. Arabic relation extraction: A survey
Hao et al. SCESS: a WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis
Singh et al. Review of real-word error detection and correction methods in text documents
CN110069632B (en) Deep learning text classification method integrating shallow semantic expression vectors
CN114943220B (en) Sentence vector generation method and duplicate checking method for scientific research establishment duplicate checking
Sen et al. Chinese automatic text simplification based on unsupervised learning
CN111898343B (en) Similar topic identification method and system based on phrase structure tree
Shekhar et al. Computational linguistic retrieval framework using negative bootstrapping for retrieving transliteration variants
CN114417008A (en) Construction engineering field-oriented knowledge graph construction method and system
CN113569560A (en) Automatic scoring method for Chinese bilingual composition
Lee Natural Language Processing: A Textbook with Python Implementation
Batarfi et al. Building an Arabic semantic lexicon for Hajj
CN116702786B (en) Chinese professional term extraction method and system integrating rules and statistical features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant