CN105824798A

CN105824798A - Examination question de-duplicating method of examination question base based on examination question key word likeness

Info

Publication number: CN105824798A
Application number: CN201610117476.2A
Authority: CN
Inventors: 江龙; 李泽河; 曹俊豪; 张德刚; 王达达
Original assignee: Education Training and Evaluation Center of Yunnan Power Grid Co Ltd
Current assignee: Education Training and Evaluation Center of Yunnan Power Grid Co Ltd
Priority date: 2016-03-03
Filing date: 2016-03-03
Publication date: 2016-08-03

Abstract

The invention relates to an examination question de-duplicating method of an examination question base based on examination question key word likeness. The examination question de-duplicating method comprises the following steps: firstly, performing Chinese word segmentation on examination questions so as to obtain word segmentation knots; judging whether the word segmentation knots are key words or not, if yes, adding the word segmentation knots to a relational database of the examination questions and the key words; then calculating the likeness of any two examination questions to be detected in the relational database of the examination questions and the key words by using a scalar product; secondly, judging whether the two examination questions to be detected are non-like examination questions and adding the like examination questions into a duplication examination question relational database; searching a duplication examination question list from the duplication examination question relational database according to the likeness condition; finally, confirming the duplication examination questions by observing a duplication examination question list by an administrator so as to judge whether the examination questions are duplicated manually. According to the examination question de-duplicating method disclosed by the invention, Chinese word segmentation is performed on question stems, examination question candidate items and examination question answers of the examination questions, segmented words after word segmentation is performed are analyzed, and the examination questions are deeply analyzed, so that de-duplication is accurate. The method disclosed by the invention can be widely used in the field of de-duplication of the examination questions.

Description

Examination question De-weight method in test item bank based on examination question keyword similarity

Technical field

The present invention relates to a kind of examination question De-weight method, especially with regard to the examination question De-weight method in a kind of test item bank based on examination question keyword similarity.

Background technology

Along with carrying out of all kinds of works about test over the years, inside test item bank, the exercise question of accumulation also gets more and more, and has gradually formed the test item bank of magnanimity.Owing to the examination question in the test item bank of certain a branch of instruction in school is in different periods, formed by the expert of different majors and varying level writing of making joint efforts, result in all kinds of forms of appearance different, such as multiple-choice question, filling topic, True-False conciliate answer etc., and there is different difficulties, but the repetition examination question that implication is similar or identical, although the form of expression repeating examination question may be more, but can be attributed to following two classes:

(1) examination question that examination question word content is identical or word content is the most close and answer is identical；

(2) character express of examination question is different or topic type is different but examination knowledge is identical；

Repetition item analysis for existing Test System judges to repeat examination question only by analysis stem word is the most identical, the analysis ability repeating examination question from stem character analysis is very limited, hinders problems such as examination point analysis, test papers and the building-up of question banks.It addition, only by analyzing, stem word is the most identical to be judged to repeat examination question comprehensively, repeats examination question discrimination the highest, and precision is inadequate.

Summary of the invention

For the problems referred to above, it is an object of the invention to provide the examination question De-weight method in a kind of test item bank based on examination question keyword similarity, the highest to improve repetition examination question discrimination, and the problem that precision is inadequate.

For achieving the above object, the present invention takes techniques below scheme: the examination question De-weight method in a kind of test item bank based on examination question keyword similarity, it comprises the following steps: 1) use maximum forward participle matching algorithm that the examination question in test item bank is carried out Chinese word segmentation, Chinese word segmentation includes the stem of examination question, examination question candidate item and script in test item bank are carried out Chinese word segmentation, and the participle obtained is referred to as participle knot；Judge whether participle knot is the key word in examination question keywords database, if the key word in examination question keywords database, then it is added into the relational database of examination question and keyword, and the relational database of examination question and keyword includes the order that the frequency of occurrences of keyword, keyword weights and keyword occur；Wherein, examination question key word library presets examination question keyword；2) similarity between any two examination question to be detected in the relational database of inner product calculating examination question and keyword is used；3) by within product representation similarity with repeat compared with examination question threshold value, if repetition examination question threshold value the most set in advance, then perform step 4)；If more than repetition examination question threshold value set in advance, then perform step 5)；4) two examination questions to be detected are non-similar examination question, do not process；5) two examination questions to be detected are similar examination question, and similar examination question adds repetition examination question relational database；6) according to similarity condition, from repetition examination question relational database, the repetition examination question list meeting condition is found out；7) management personnel carry out repeating examination question confirmation by valuing retrial topic list, artificially judge whether examination question repeats.

Described step 3) in, it is judged that within whether examination question repeats, the repetition examination question threshold value of product representation is 0.80.

Due to the fact that and take above technical scheme, it has the advantage that the present invention carries out Chinese word segmentation initially with maximum forward participle matching algorithm to the examination question in test item bank, and the participle obtained is referred to as participle knot；Judge whether participle knot is the key word in examination question keywords database, if the key word in examination question keywords database, is then added into the relational database of examination question and keyword；Then, the similarity between any two examination question to be detected in the relational database of inner product calculating examination question and keyword is used；Secondly, by within the similarity of product representation compared with repeating examination question threshold value, it is judged that whether two examination questions to be detected are non-similar examination question, and similar examination question adds repetition examination question relational database；Again, according to similarity condition, from repetition examination question relational database, the repetition examination question list meeting condition is found out；Finally, management personnel carry out repeating examination question confirmation by valuing retrial topic list, artificially judge whether examination question repeats.Processing step by above, the present invention not only stem to examination question carries out Chinese Word Segmentation, also examination question candidate item and script is carried out Chinese word segmentation, analyzes for the participle after cutting word comprehensively, thus analyse in depth examination question.It addition, use the mode of manual confirmation can improve the accuracy rate heavily inscribing judgement, improve and remove the precision heavily inscribed.Therefore, present invention can be widely used to examination question duplicate removal field.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in describing below is only some embodiments of the present invention, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of the present invention.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art are obtained under not making creative work premise, broadly fall into the scope of protection of the invention.

Embodiment

As it is shown in figure 1, the examination question De-weight method in a kind of test item bank based on examination question keyword similarity of the present invention, it comprises the following steps:

1) using maximum forward participle matching algorithm that the examination question in test item bank is carried out Chinese word segmentation, Chinese word segmentation includes the stem of examination question, examination question candidate item and script in test item bank are carried out Chinese word segmentation, and the participle obtained is referred to as participle knot；Judge whether participle knot is the key word in examination question keywords database, if the key word in examination question keywords database, then it is added into the relational database of examination question and keyword, and the relational database of examination question and keyword includes the order that the frequency of occurrences of keyword, keyword weights and keyword occur；Wherein, examination question key word library presets examination question keyword, such as T₁, T₂..., T_m。

It should be noted that the algorithm that maximum forward participle matching algorithm is known to the skilled person, therefore no longer describe in detail.

A, employing maximum forward participle matching algorithm carry out Chinese word segmentation to the examination question in test item bank, and Chinese word segmentation includes the stem of examination question, examination question candidate item and script in test item bank are carried out Chinese word segmentation, and the participle obtained is referred to as participle knot；

The embodiment of the present invention uses maximum forward participle matching algorithm that examination question is carried out Chinese word segmentation, maximum forward participle matching algorithm is from left to right to be mated with vocabulary by the several continuation characters treating in participle examination question, if the character obtained matches with the word in vocabulary, then it is syncopated as a word；Otherwise, do not process.If wanting to accomplish maximum match, not the most that once coupling can be carried out cutting, being illustrated by example below:

nullTreat participle examination question: content []={ " directly "，" line "，" bar "，" tower "，" "，" hang down "，" straight "，" shelves "，" away from "，" more "，" big "，" absolutely "，" edge "，" son "，" string "，" institute "，" by "，" "，" lotus "，" weight "，" just "，" more "，" big " }，I.e. content [1] is " directly "，Content [2] is " line "，Content [3] is " bar "，Content [4] is " tower "，Content [5] be " "，Content [6] is " hanging down "，Content [7] is " directly "，Content [8] is " shelves "，Content [9] be " away from "，Content [10] is " getting over "，Content [11] is " greatly "，Content [12] is " absolutely "，Content [13] is " edge "，Content [14] is " sub "，Content [15] is " institute "，Content [16] is " being subject to "，Content [17] be " "，Content [18] is " lotus "，Content [19] is " weight "，Content [20] is " just "，Content [21] is " getting over "，Content [22] is " greatly ".

Vocabulary: vocabulary dict []={ " straight line ", " shaft tower ", " straight line pole ", " insulator " }, wherein, dict [1] is " straight line ", and dict [2] is " shaft tower ", and dict [3] is " straight line pole ", and dict [4] is " insulator ".

As follows for the maximum forward participle matching algorithm solution procedure treating participle examination question:

1., from the beginning of content [1], when scanning content [2] when, find that " straight line " suffers at vocabulary dict [], therefore can not cut out, because not knowing whether obtained word is longer word, i.e. maximum match, it is therefore desirable to continue to scan on；

2. content [3] is continued to scan on, find that " straight line pole " is not the word in vocabulary dict [], but can't determine that " straight line " that above find has been maximum word, because " straight line pole " is the prefix of dict [3], it is therefore desirable to continue to scan on；

3. continue to scan on content [4], find that " straight line pole " is the word in vocabulary dict [], but still can not cut out, because not knowing whether obtained word is longer word, i.e. maximum match, it is therefore desirable to continue to scan on；

4. continue to scan on content [5], find that " straight line pole " is not the word in vocabulary, be not the prefix of word.Therefore the most maximum word can be syncopated as " straight line pole ".

Understanding for the maximum forward participle matching algorithm solution procedure treating participle examination question, the word that maximum match goes out must assure that next scanning is not that the prefix of the word in vocabulary or word just can terminate.

B, judge participle knot whether be the key word in examination question keywords database, if the key word in examination question keywords database, then it is added into the relational database of examination question and keyword, and the relational database of examination question and keyword includes the order that the frequency of occurrences of keyword, keyword weights and keyword occur；

Trie tree construction is used to be stored with examination question by keyword, this kind of mode is used to store, making the time complexity searching each word is O (word.length), and can judge whether that the match is successful or has matched the prefix of character string very easily.Storage organization is:

The most each node is a Chinese character in word；

2. the pointer in node has pointed to this Chinese character next Chinese character in some word.These pointers leave in the hash structure with Chinese character as key；

3. the Chinese character during " # " in node represents current node is the last character of the word formed to this Chinese character node from root node.

2) similarity between any two examination question to be detected in the relational database of inner product calculating examination question and keyword is used；

In tradition vector space model, it is elementary composition vector that the examination question to be detected that examination question to be detected compares with it is all expressed as with examination question key word, each examination question keyword root is according to word frequency TF and inverse text frequency IDF (TF-IDF, Termfrequency-inversedocumentfrequency, word frequency-inverted file frequency) it is endowed certain weights, then by cosine angle between vector element or the similarity calculated between examination question to be detected of other parameter, the similarity asking the method for co sinus vector included angle to obtain between examination question to be detected in Euclidean space is used.

In examination question vector space model, per pass examination question is by separate key word T₁, T₂..., T_mConstitute, make D=(D₁, D₂..., D_n) it is the set of n the examination question that m indexing key words is constituted, wherein D_j=(d_1j, d_2j..., d_mj)^TIt is examination question vector, d_ijRepresent that key word i occurs the frequency weight in examination question, and 1≤i≤m, query vector Q are expressed as Q=(q₁, q₂..., q_m)^T, q_iRepresent that frequency weight in queries occurs in key word i, this defines a m and tie up key words content vector space, i.e. examination question keyword vector space.

To examination question Similarity Measure to be detected, we are calculated by inner product formula, if D_iWith D_jIt is the examination question that in the set D of examination question, any two differs, and D_i=(d_1i, d_2i..., d_mi)^T, D_j=(d_1j, d_2j..., d_mj)^T, then D_iWith D_jBetween similarity inner product be expressed as follows:

S i m (D_{i}, D_{j}) = Σ_{k = 1}^{m} d_{k i} d_{k j}

Wherein, d_kiThe frequency weight occurred in examination question to be detected for key word k, and 1≤k≤m.

3) by within product representation similarity with repeat compared with examination question threshold value, if repetition examination question threshold value the most set in advance, then perform step 4)；If more than repetition examination question threshold value set in advance, then perform step 5)；

Above-mentioned included angle cosine is used for measuring the size of angle between two groups of vectors, also known as phase and coefficient, and the geometric meaning of included angle cosine is by N number of elementary composition N-dimensional space, characterizes the cosine value of angle between two vectors.Typically needing before use each element in vector is carried out nondimensionalization process, making each element is just all, and at this moment the span of included angle cosine is [0,1], and value shows that the most greatly two vector angles are the least, both closer to, when value is 1, two vector identical.

As Sim (D_i, D_jDuring)≤0.80, then perform step 4)；

As Sim (D_i, D_j) ＞ 0.80 time, then perform step 5)；

4) two examination questions to be detected are non-similar examination question, do not process；

5) two examination questions to be detected are similar examination question, and similar examination question adds repetition examination question relational database；

6) according to similarity condition, from repetition examination question relational database, the repetition examination question list meeting condition is found out；

Wherein, similarity condition is condition well known to those skilled in the art, therefore no longer describes in detail.

7) management personnel carry out repeating examination question confirmation by valuing retrial topic list, artificially judge whether examination question repeats, and use the mode of manual confirmation can improve the accuracy rate heavily inscribing judgement, improve and remove the precision heavily inscribed.

The various embodiments described above are merely to illustrate the present invention; the structure of the most each parts, connected mode and processing technology etc. all can be varied from; every equivalents carried out on the basis of technical solution of the present invention and improvement, the most should not get rid of outside protection scope of the present invention.

Claims

1. the examination question De-weight method in test item bank based on examination question keyword similarity, it comprises the following steps:

1) using maximum forward participle matching algorithm that the examination question in test item bank is carried out Chinese word segmentation, Chinese word segmentation includes the stem of examination question, examination question candidate item and script in test item bank are carried out Chinese word segmentation, and the participle obtained is referred to as participle knot；Judge whether participle knot is the key word in examination question keywords database, if the key word in examination question keywords database, then it is added into the relational database of examination question and keyword, and the relational database of examination question and keyword includes the order that the frequency of occurrences of keyword, keyword weights and keyword occur；

Wherein, examination question key word library presets examination question keyword；

7) management personnel carry out repeating examination question confirmation by valuing retrial topic list, artificially judge whether examination question repeats.

Examination question De-weight method in test item bank based on examination question keyword similarity the most according to claim 1, it is characterised in that: described step 3) in, it is judged that within whether examination question repeats, the repetition examination question threshold value of product representation is 0.80.