CN109086313A

CN109086313A - One kind carrying out examination question based on inverse text similarity and orders rearrangement processed

Info

Publication number: CN109086313A
Application number: CN201810681320.6A
Authority: CN
Inventors: 马赫
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-06-27
Filing date: 2018-06-27
Publication date: 2018-12-25

Abstract

Examination question is carried out based on inverse text similarity the invention discloses one kind and orders rearrangement processed, include the following steps, it needs to carry out duplicate checking processing to every problem of upload when the topic of oneself is uploaded to original problem database system by proposition staff first, duplicate checking processing, which is taken, carry out in the way of element category to original exam pool, then the similarity of more every topic one by one is gone in corresponding original exam pool by element, when the proposition of oneself is uploaded to original exam pool simultaneously by multiple proposition staff, the examination question for first uploading to interim problem database system needs first to be compared with the examination question in interim problem database system, then examination question of the topic of interim exam pool together with formal problem database system compares duplicate checking.The invention designs advantages of simple, easy to operate, not only reduces manual labor, and improve examination question re-scheduling efficiency, while similar examination question can effectively be avoided to upload, and improves and examines topic quality, safety and stability is applied widely, is conducive to popularize.

Description

One kind carrying out examination question based on inverse text similarity and orders rearrangement processed

Technical field

The invention belongs to examination questions to order technical field processed, more specifically, more particularly to it is a kind of based on inverse text similarity into Row examination question orders rearrangement processed.

Background technique

During examination question life is processed, the similarity of topic directly affects the quality of examination question, how to reduce the similarity of topic, By the investigation and judgement of pure artificial mechanism formula, inefficiency is time-consuming and laborious, and the knowledge and ability of effect topic personnel on trial Horizontal influence.To improve re-scheduling quality, examination question re-scheduling is carried out using computer, due to the original examination question of existing problem database system Enormous amount calculates if the examination question uploaded needs to compare duplicate checking one by one and workload is very big, algorithm to re-scheduling and The performance requirement of re-scheduling machine is very high, and traditional solution belongs to mechanical, artificial examination question and examines topic method, re-scheduling The influence of the knowledge and ability level of effect topic personnel on trial, although the similarity analysis method energy based on character string pattern matching The similar caused examination question replication problem of text is enough solved, but helpless for the examination question of semantic similarity, is unfavorable for extensively Promotion and popularization.

Summary of the invention

The purpose of the present invention is to solve disadvantages existing in the prior art, and the one kind proposed is similar based on inverse text Degree carries out examination question and orders rearrangement processed.

To achieve the above object, the invention provides the following technical scheme:

One kind carrying out examination question based on inverse text similarity and orders rearrangement processed, includes the following steps:

The per pass to upload is needed when S1, first proposition staff upload to the topic of oneself in original problem database system Topic carries out duplicate checking processing, prevents from uploading same or similar topic type；

S2, duplicate checking processing are taken and carry out original exam pool by element category, in the way of the corresponding domain classification of topic, so It need to only be determined when re-scheduling compares afterwards and compare element corresponding to topic, then go in corresponding original exam pool to compare one by one by element The similarity of every topic；

S3, the similarity for judging two examination questions are first to construct high quality according to self-built test item bank and disclosed test item bank Available corpus and participle library, and based on dictionary using segmenter to examination question carry out morphological analysis；

S4, the word frequency (TF) for then calculating examination question are calculated " inverse document frequency " (IDF) based on corpus, and are calculated with this The TF-IDF of word；

S5, the TF-IDF again based on participle and word construct the feature vector of examination question, by the meaning of a word of examination question and the semanteme amount of progress Change；

S6, finally by the feature vector of the examination question compared two-by-two using the cosine law calculate similarity, in this, as examination question The foundation of similarity re-scheduling；

When the proposition of oneself is carried out uploading to original exam pool by S7, multiple proposition staff simultaneously, life in a period of time is taken The proposition of topic personnel is all to upload to the mode of interim problem database system, and the examination question newly uploaded needs first with the examination in interim problem database system Topic is compared, the examination question of the two propositions personnel for uploading to interim exam pool simultaneously, if there is similar or mutually sympathize with Condition, is subject to first uploader；

Examination question of the topic of S8, then interim exam pool together with formal problem database system compares duplicate checking, is then incorporated into In formal problem database system.

Preferably, the multiple tracks examination question uploaded in the step S1 for proposition staff, oneself needs to carry out at duplicate checking first Reason, the topic for avoiding oneself from uploading have similar or same type topic.

Preferably, the calculating step in the step S4 is 1) to calculate word frequency: word frequency=appearance of some word in examination question Number/article total degree；2) inverse document frequency (IDF) is calculated: inverse document frequency=log corpus examination question sum/comprising being somebody's turn to do The examination question number+1 of word；3) term frequency-inverse document frequency (TF-IDF) is calculated: term frequency-inverse document frequency=word frequency * inverse document frequency.

Preferably, text is indicated in the step S5 in the form of numerical characteristics vector, mainly passes through two steps To realize；1. creating a unique label for each word on entire examination question collection (containing many examination questions)；2. being every A examination question constructs a feature vector, mainly includes frequency of occurrence of each word on document.

Preferably, four steps (1) are divided into the step S6 and count what all words in two comparison texts occurred respectively Frequency, to obtain the corresponding vector of two texts；(2) the included angle cosine value of the two vectors is calculated using the cosine law；(3) The range of the included angle cosine value of two vectors be 0-1,1 represent it is identical, 0 represent it is completely not identical；According to the threshold from setting Value judges whether two texts are similar；(4) threshold value is set as 90%, and the examination question for regarding as comparing two-by-two if it is greater than 90% is It is identical, the similarity value of examination question is returned if lower than 90%, is referred to for examination question uploader and careful topic personnel, is participated in be artificial Judgement provides foundation.

It preferably, can be in the base of test item bank element in the step S8 in the optimization of the order of magnitude problem of calculating The sub- element category that element is used on plinth, further decreases the number compared, accelerates the efficiency of re-scheduling.

Technical effect and advantage of the invention: provided by the invention a kind of based on inverse text similarity progress examination question life system row Weighing method, compared with traditional method, the present invention first constructs available corpus and the participle library of high quality, and is based on dictionary Morphological analysis is carried out to examination question using segmenter；The word frequency (TF) for calculating examination question calculates " inverse document frequency " based on corpus (IDF), the TF-IDF of word and with this is calculated；The feature vector of TF-IDF building examination question based on participle and word, by the word of examination question Justice and semanteme are quantified；The feature vector of the examination question compared two-by-two is calculated into similarity using the cosine law, in this, as examination Inscribe the foundation of similarity re-scheduling；Based on the similarity analysis of " inverse text similarity ", can constantly be trained constantly more based on one New corpus, similar caused examination question replication problem semantic to the text phase Sihe of examination question, can provide more accurate More scientific similarity foundation, and interim exam pool can also be first passed through when uploading proposition, multiple proposition persons are uploaded Examination question carry out preparatory re-scheduling, the examination question after re-scheduling is finally uploaded to formal exam pool, reduce repetitive rate that examination question uploads or Similarity improves the efficiency of examination question re-scheduling by the setting of test item bank key element and sub- element, and invention design is simple to close Reason, it is easy to operate, manual labor is not only reduced, and improve the efficiency of examination question re-scheduling, while can effectively avoid Similar examination question is uploaded, improves and examines topic quality, safety and stability is applied widely, is conducive to popularize.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is carried out further detailed Explanation.It should be appreciated that the specific embodiments described herein are only used to explain the present invention, it is not intended to limit the present invention. Based on the specific embodiment in the present invention, those of ordinary skill in the art are obtained without making creative work Every other embodiment, shall fall within the protection scope of the present invention.

When the proposition of oneself is carried out uploading to original exam pool by S7, multiple proposition staff simultaneously, life in a period of time is taken The proposition of topic personnel is all to upload to the mode of interim problem database system, and the examination question newly uploaded needs first in interim problem database system Examination question is compared, the examination question of the two propositions personnel for uploading to interim exam pool simultaneously, if there is similar or identical Situation, is subject to first uploader；

Further, the multiple tracks examination question uploaded in the step S1 for proposition staff, oneself needs to carry out duplicate checking first Processing, the topic for avoiding oneself from uploading have similar or same type topic.

Further, the calculating step in the step S4 is 1) to calculate word frequency: word frequency=some word going out in examination question Occurrence number/article total degree；2) calculating inverse document frequency (IDF): inverse document frequency ,=the examination question sum of log corpus/includes The examination question number+1 of the word；3) calculate term frequency-inverse document frequency (TF-IDF): term frequency-inverse document frequency=word frequency * against document frequency Rate.

Further, text is indicated in the form of numerical characteristics vector in the step S5, it is main to pass through two steps It is rapid to realize；1. creating a unique label for each word on entire examination question collection (containing many examination questions)；2. being Each examination question constructs a feature vector, mainly includes frequency of occurrence of each word on document.

Further, four steps (1) are divided into the step S6 and count all words appearance in two comparison texts respectively Frequency, to obtain the corresponding vector of two texts；(2) the included angle cosine value of the two vectors is calculated using the cosine law； The range of the included angle cosine value of (3) two vectors be 0-1,1 represent it is identical, 0 represent it is completely not identical；According to setting certainly Whether two texts of threshold decision are similar；(4) threshold value is set as 90%, regards as the examination question compared two-by-two if it is greater than 90% Be it is identical, if lower than the similarity value for returning to examination question if 90%, referred to for examination question uploader and careful topic personnel, for artificial ginseng Foundation is provided with judgement.

It further, can be in test item bank element in the step S8 in the optimization of the order of magnitude problem of calculating On the basis of use element sub- element category, further decrease the number compared, accelerate the efficiency of re-scheduling.

In summary: it is provided by the invention a kind of based on inverse text similarity progress examination question life rearrangement processed, with tradition Method compare, the present invention first construct high quality available corpus and participle library, and based on dictionary use segmenter pair Examination question carries out morphological analysis；The word frequency (TF) for calculating examination question calculates " inverse document frequency " (IDF) based on corpus, and in terms of this Calculate the TF-IDF of word；The feature vector of TF-IDF building examination question based on participle and word, by the meaning of a word of examination question and the semanteme amount of progress Change；The feature vector of the examination question compared two-by-two is calculated into similarity using the cosine law, in this, as examination question similarity re-scheduling Foundation；Based on the similarity analysis of " inverse text similarity ", the corpus constantly updated can be constantly trained based on one, to examination The semantic similar caused examination question replication problem of the text phase Sihe of topic, can provide more accurate more scientific similarity Foundation, and can also first pass through interim exam pool when proposition uploading the examination question that multiple proposition persons upload is arranged in advance Examination question after re-scheduling is finally uploaded to formal exam pool by weight, is reduced repetitive rate or similarity that examination question uploads, is passed through test item bank The setting of key element and sub- element improves the efficiency of examination question re-scheduling, which designs advantages of simple, easy to operate, not only drops Low manual labor, and the efficiency of examination question re-scheduling is improved, while can effectively avoid uploading similar examination question, it improves and examines Quality is inscribed, safety and stability is applied widely, is conducive to popularize.

Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features, All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. one kind carries out examination question based on inverse text similarity and orders rearrangement processed, which comprises the steps of:

S1, first proposition staff need to carry out every problem of upload when the topic of oneself is uploaded to original problem database system Duplicate checking processing, prevents from uploading same or similar topic type；

S2, duplicate checking processing are taken and carry out in the way of the corresponding domain classification of topic, so by element category to original exam pool It need to only be determined when re-scheduling compares afterwards and compare element corresponding to topic, then go in corresponding original exam pool to compare one by one by element The similarity of every topic；

S3, the similarity for judging two examination questions be first construct high quality according to self-built test item bank and disclosed test item bank can Corpus and participle library, and morphological analysis is carried out to examination question using segmenter based on dictionary；

S4, the word frequency (TF) for then calculating examination question calculate " inverse document frequency " (IDF) based on corpus, and calculate word with this TF-IDF；

S5, the TF-IDF again based on participle and word construct the feature vector of examination question, and the meaning of a word of examination question and semanteme are quantified；

S6, finally by the feature vector of the examination question compared two-by-two using the cosine law calculate similarity, it is similar in this, as examination question Spend the foundation of re-scheduling；

When the proposition of oneself is uploaded to original exam pool simultaneously by S7, multiple proposition staff, proposition staff in a period of time is taken Proposition is all to upload to the mode of interim problem database system, and the examination question newly uploaded needs first to carry out with the examination question in interim problem database system It compares, the examination question of the two propositions personnel for uploading to interim exam pool simultaneously, if there is similar or same case, with elder generation Subject to uploader；

Examination question of the topic of S8, then interim exam pool together with formal problem database system compares duplicate checking, is then incorporated into formal Problem database system in.

2. according to claim 1 a kind of based on inverse text similarity progress examination question life rearrangement processed, it is characterised in that: The multiple tracks examination question uploaded in the step S1 for proposition staff, oneself needs to carry out duplicate checking processing first, avoids oneself uploading Topic have similar or same type topic.

3. according to claim 1 a kind of based on inverse text similarity progress examination question life rearrangement processed, it is characterised in that: Calculating step in the step S4 is 1) to calculate word frequency: frequency of occurrence/article total degree of word frequency=some word in examination question； 2) inverse document frequency (IDF) is calculated: the inverse document frequency=log corpus examination question sum/examination question number+1 comprising the word；3) it counts It calculates term frequency-inverse document frequency (TF-IDF): term frequency-inverse document frequency=word frequency * inverse document frequency.

4. according to claim 1 a kind of based on inverse text similarity progress examination question life rearrangement processed, it is characterised in that: Text is indicated in the form of numerical characteristics vector in the step S5, is mainly realized by two steps；1. being entire Each word on examination question collection (containing many examination questions) creates a unique label；2. constructing a spy for each examination question Vector is levied, mainly includes frequency of occurrence of each word on document.

5. according to claim 1 a kind of based on inverse text similarity progress examination question life rearrangement processed, it is characterised in that: It is divided into four steps (1) in the step S6 and counts the frequency that all words occur in two comparison texts respectively, to obtains two The corresponding vector of a text；(2) the included angle cosine value of the two vectors is calculated using the cosine law；The angle of (3) two vectors The range of cosine value be 0-1,1 represent it is identical, 0 represent it is completely not identical；According to two texts of threshold decision from setting It is whether similar；(4) threshold value is set as 90%, the examination question for regarding as comparing two-by-two if it is greater than 90% be it is identical, if low The similarity value that examination question is returned in 90% is referred to for examination question uploader and careful topic personnel, participates in judging to provide foundation to be artificial.

6. according to claim 1 a kind of based on inverse text similarity progress examination question life rearrangement processed, it is characterised in that: For the son of element can be used on the basis of test item bank element in the optimization of the order of magnitude problem of calculating in the step S8 Element category further decreases the number compared, accelerates the efficiency of re-scheduling.