CN110390019A

CN110390019A - A kind of clustering method of examination question, De-weight method and system

Info

Publication number: CN110390019A
Application number: CN201910680927.7A
Authority: CN
Inventors: 谢楚鹏; 李可佳; 郭晨阳
Original assignee: Jiangsu Qusu Education Technology Co Ltd
Current assignee: Jiangsu Qusu Education Technology Co Ltd
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2019-10-29

Abstract

The invention discloses a kind of clustering method of examination question, De-weight method and systems.The clustering method of examination question, comprising: choose cluster centre examination question in all examination questions for participating in cluster；It determines that the important key character of cluster centre examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second character string, important key character is newly-increased, replacement or can change the character of examination question meaning or type after modifying；Calculate the weighing edit distance between the first character string and the second character string, the least weighting operations number that weighing edit distance mutually converts between the first character string and the second character string；The similarity between examination question and cluster centre examination question to be clustered is calculated according to weighing edit distance；Similarity is greater than the examination question to be clustered of preset threshold and cluster centre examination question is classified as same examination question class.The present invention, which can be realized, efficiently clusters extensive examination question.

Description

A kind of clustering method of examination question, De-weight method and system

Technical field

The present invention relates to field of Educational Technology, more particularly, to a kind of clustering method of examination question, De-weight method and are System.

Background technique

Different examination question supplier in education sector, such as Test Centre, Jiao Fu publisher, training organization and each The teacher that sets a question of school can provide a large amount of examination question.As digital information is in the application of education sector, the supply of these examination questions Quotient can also be had unavoidably in these a large amount of examination questions using providing a user examination question by the way of line platform or terminal software The examination question of many same types either high examination question of similarity.

Therefore it provides a kind of clustering method of examination question, De-weight method and system, realize and efficiently carry out to extensive examination question Cluster, is this field technical problem urgently to be resolved.

Summary of the invention

In view of this, solving above-mentioned technology the present invention provides a kind of clustering method of examination question, De-weight method and system Problem.

In a first aspect, the present invention provides a kind of clustering method of examination question, comprising:

Cluster centre examination question is chosen in all examination questions for participating in cluster；

It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important of examination question to be clustered Key character is denoted as the second character string, the important key character be can change after newly-increased, replacement or modification examination question meaning or The character of person's type；

Calculate the weighing edit distance between first character string and second character string, the weighing edit distance The least weighting operations number mutually converted between first character string and second character string；

The similarity between the examination question to be clustered and the cluster centre examination question is calculated according to the weighing edit distance, Wherein, the calculation formula of similarity r are as follows:

R=(sum-dist)/sum, wherein sum is the length summation of first character string and second character string, Dist is the weighing edit distance；

Similarity is greater than the examination question to be clustered of preset threshold and the cluster centre examination question is classified as same examination question class.

Optionally, before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes:

Unified examination question format, wherein include:

Classification and Identification and Context resolution are carried out to the htm test question files comprising kinds of characters format or formula picture, turned Change latex examination question text into；

By latex examination question text conversion at can normal reading text formatting.

Optionally, the step of choosing cluster centre examination question in all examination questions for participating in cluster specifically includes:

According to the quality evaluation of the creation time of examination question and examination question, all examination questions for participating in cluster are ranked up；

Selected and sorted is primary examination question as the cluster centre examination question.

Optionally, it determines that the important key character of the cluster centre examination question is denoted as the first character string, determines examination to be clustered Important key character the step of being denoted as the second character string of topic includes:

Important keyword character library is constructed using term frequency-inverse document frequency model；

First character string and second character string are determined according to the important keyword character library.

Optionally, the operation of the weighing edit distance includes: insertion, deletion, replacement；Wherein, weighting operations are being calculated When number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.

Second aspect, the present invention also provides a kind of De-weight methods of examination question, comprising: using it is provided by the invention any one The clustering method of examination question treats the examination question in duplicate removal examination question group and carries out clustering processing；

Delete the examination question for belonging to same examination question class with the cluster centre examination question.

The third aspect, the present invention provide a kind of clustering system of examination question, comprising: cluster centre examination question chooses module, important Key character determining module, weighing edit distance computing module, similarity calculation module and examination question classifying module；Wherein,

The cluster centre examination question chooses module, is connected with the important key character determining module, for all It participates in choosing cluster centre examination question in the examination question of cluster, and the cluster centre examination question of selection is sent to the important key Character determining module；

The important key character determining module, is connected, for determining with the weighing edit distance computing module The important key character for stating cluster centre examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as Two character strings, and first character string and second character string are sent to the weighing edit distance computing module, institute State the character that important key character is newly-increased, replacement or can change examination question meaning or type after modifying；

The weighing edit distance computing module, is connected with the similarity calculation module, for calculating described first Weighing edit distance between character string and second character string, and the weighing edit distance is sent to the similarity Computing module, the weighing edit distance mutually convert least between first character string and second character string Weighting operations number；

The similarity calculation module is connected with examination question classifying module, for being calculated according to the weighing edit distance Similarity between the examination question to be clustered and the cluster centre examination question, and the similarity is sent to the examination question and is sorted out Module, wherein the calculation formula of similarity r are as follows: r=(sum-dist)/sum, wherein sum is first character string and institute The length summation of the second character string is stated, dist is the weighing edit distance；

The examination question classifying module, for being greater than similarity in the examination question to be clustered and the cluster of preset threshold Heart examination question is classified as same examination question class.

It optionally, further include uniform format module, the uniform format module chooses module with the cluster centre examination question It is respectively connected with the important key character determining module, for the htm comprising kinds of characters format or formula picture Test question files carry out Classification and Identification and Context resolution, are converted into latex examination question text；It is also used to latex examination question text conversion At can normal reading text formatting.

Optionally, the important key character determining module further includes that character repertoire constructs module and character string determining module, Wherein,

The character repertoire constructs module, for constructing important keyword character library using term frequency-inverse document frequency technology；

The character string determining module, for determining first character string and institute according to the important keyword character library State the second character string.

Fourth aspect, the present invention provide a kind of machining system of examination question, including any one examination question provided by the invention Clustering system further includes examination question deduplication module, and the examination question deduplication module is connected with the examination question classifying module, for receiving The examination question classification results that the examination question classifying module is sent, and delete the examination for belonging to same examination question class with the cluster centre examination question Topic.

Compared with prior art, the clustering method of examination question provided by the invention, De-weight method and system, at least realize as Under the utility model has the advantages that

(1) clustering method of examination question provided by the invention chooses cluster centre in all examination questions for participating in cluster first Examination question is then based on the important keyword in cluster centre examination question and examination question to be clustered as weight, calculates cluster centre examination question Weighing edit distance between the important keyword in examination question to be clustered, and then calculate cluster centre examination question and examination question to be clustered Similarity, Lai Shixian clustering, the editing distance between the shorter character string of use calculates to evaluate similarity, Neng Goujian Change calculating process, and can be realized and extensive examination question is efficiently clustered.

(2) clustering method based on examination question provided by the invention, the preset threshold used when by judging similarity into Row setting, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities is almost It can be determined that as the repetition examination question compared with cluster centre examination question, will only retain in cluster after the very high examination question removal of similarity Heart examination question can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.

Certainly, implementing any of the products of the present invention specific needs while must not reach all the above technical effect.

By referring to the drawings to the detailed description of exemplary embodiment of the present invention, other feature of the invention and its Advantage will become apparent.

Detailed description of the invention

It is combined in the description and the attached drawing for constituting part of specification shows the embodiment of the present invention, and even With its explanation together principle for explaining the present invention.

Fig. 1 is the clustering method flow chart of examination question provided in an embodiment of the present invention；

Fig. 2 is the examination question cluster schematic diagram generated using clustering method provided by the invention；

Fig. 3 is the De-weight method flow chart of examination question provided in an embodiment of the present invention；

Fig. 4 is the clustering system block diagram one of examination question provided in an embodiment of the present invention；

Fig. 5 is the clustering system block diagram two of examination question provided in an embodiment of the present invention；

Fig. 6 is the machining system block diagram of examination question provided in an embodiment of the present invention.

Specific embodiment

Carry out the various exemplary embodiments of detailed description of the present invention now with reference to attached drawing.It should also be noted that unless in addition having Body explanation, the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally The range of invention.

Be to the description only actually of at least one exemplary embodiment below it is illustrative, never as to the present invention And its application or any restrictions used.

Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as part of specification.

It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without It is as limitation.Therefore, other examples of exemplary embodiment can have different values.

It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.

In the related technology, for the examination question of negligible amounts, it can be compared one by one based on editing distance, judge each examination The similarity degree of topic, and remove the high examination question of similarity.Wherein, when calculating the editing distance, it usually needs two examination questions of statistics Full text between the number of operations of needs that mutually converts, calculating process is relative complex.Based on this, the present invention provides one kind Based on the important keyword in examination question as weight, and weighing edit distance is calculated to evaluate the clustering method of similarity, use Editing distance between shorter character string calculates to evaluate similarity, can simplify calculating process, and can be realized to efficient Extensive examination question is clustered, and be further able to using the method realize to examination question carry out duplicate removal.

The present invention provides a kind of clustering method of examination question, and Fig. 1 is the clustering method stream of examination question provided in an embodiment of the present invention Cheng Tu, as shown in Figure 1, the clustering method of examination question includes:

Step S101: cluster centre examination question is chosen in all examination questions for participating in cluster.

Optionally, can according to the creation time of examination question and the quality evaluation of examination question, to all examination questions for participating in cluster into Row is integrated ordered；Then it selects integrated ordered to be primary examination question as cluster centre examination question.Wherein, quality evaluation can be used The difficulty value and examination question discrimination of examination question are measured, and the difficulty value of examination question is average rate of the subject for the examination question, are led to It often needs in 0.2~0.8 this section, including endpoint value, if the higher grade of quality evaluation in this section. Examination question discrimination refers to examination question to the size of the resolution capability of subject's ' Current Knowledge Regarding, and examination question discrimination is higher, then quality The higher grade of evaluation.Integrated ordered to carry out with the creation time and quality evaluation of examination question, wherein quality evaluation is better than creation Time, for example, the then higher examination question sequence of quality evaluation rank is located further forward when examination question creation time is identical.

Step S102: it determines that the important key character of cluster centre examination question is denoted as the first character string, determines examination question to be clustered Important key character be denoted as the second character string, important key character is that can change examination question meaning after newly-increased, replacement or modification Or the character of type.

By taking following mathematics examination questions as an example: known vector^→A=(cos3x/4, sin3x/4),^→B=(cos (π/3 x/4+) ,- sin(x/4+π/3))；Enable f (x)=(^→a+^→B) f (x) analytic expression and monotonic increase section are asked in ^2, (1)；(2) if x ∈ [- π/6, 5 π/6], find a function the maximum value and minimum value of f (x)；(3) if f (x)=5/2, the value of (π/6 x-) sin is sought.

In above-mentioned examination question, " vector " and vector symbol " → " are significant for the knowledge point mark of topic, if newly Increase, important keyword as replacement and deletion, the meaning and classification of meeting significant modification topic.It therefore, will be such in the present invention Important keyword assigns bigger weight in weighing edit distance.

Optionally, important keyword character library is constructed using term frequency-inverse document frequency model in the present invention；With same subject A large amount of examination question be data basis (such as 1,000,000 problems), using term frequency-inverse document frequency model in a large amount of examination question Important keyword is picked out, the important keyword character library for covering all knowledge points in subject substantially is formed；Then according to important Keyword character library determines the second character string in the first character string and examination question to be clustered in cluster centre examination question.First character String and weight of second character string as weighing edit distance.Based on a large amount of examination questions, important pass is picked out according to model Key word can guarantee the accuracy of important keyword selection in clustering method, and then guarantee the accuracy of subsequent similarity calculation.

Step S103: calculating the weighing edit distance between the first character string and the second character string, and weighing edit distance is The least weighting operations number mutually converted between first character string and the second character string.

Optionally, in clustering method provided by the invention, the operation of weighing edit distance includes: insertion, deletion, replacement； Wherein, when calculating weighting operations number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice. The weight for the weighing edit distance that the present invention uses will affect examination question due to important keyword for the important keyword in examination question Meaning or type replace important keyword so being denoted as replacement and operating twice when calculating weighting operations number to increase Influence in number of operations promotes the accuracy of subsequent similarity calculation.

Step S104: calculating the similarity between examination question and cluster centre examination question to be clustered according to weighing edit distance, In, the calculation formula of similarity r are as follows:

R=(sum-dist)/sum, wherein sum is the length summation of the first character string and the second character string, and dist is Weighing edit distance；

Step S105: similarity is greater than the examination question to be clustered of preset threshold and cluster centre examination question is classified as same examination question Class.

Further, clustering method further include: when the similarity of examination question to be clustered is respectively less than or is equal to preset threshold, By cluster centre examination question separately as an examination question class.

Optionally, preset threshold 0.8.It, will be in examination question to be clustered and cluster when the similarity being calculated is greater than 0.8 Heart examination question is classified as same examination question class；When the similarity being calculated is less than or equal to 0.8, by examination question to be clustered and cluster Center examination question is divided into different examination question classes.

Step S102 is after having chosen cluster centre examination question, calculate to an examination question to be clustered similar to step 105 The process spent and sorted out can generate one after being performed both by step S102 to step 105 to all examination questions to be clustered with poly- Examination question cluster centered on the examination question of class center, examination question and the similarity of cluster centre examination question in this examination question cluster compared with It is high.

Further, after generating an examination question cluster centered on cluster centre examination question, can not return to remaining The examination question for entering the cluster carries out clustering next time.Optionally, continue one cluster centre examination of selection in remaining examination question Topic then proceedes to execute step S102 to step 105, ultimately generates another examination question cluster.And so on, until will be all Examination question to be clustered carries out clustering.

There is similarity to be higher than the examination question of preset threshold around each cluster centre examination question, it is poly- to form a similar topic Class, the data of examination question are different in the same similar size inscribed in cluster according to practical examination question data and preset threshold, cluster. Fig. 2 is the examination question cluster schematic diagram generated using clustering method provided by the invention.As shown in Fig. 2, similar topic cluster (a) and phase Different preset thresholds is used like topic cluster (b).Cluster centre examination question in similar topic cluster (a) is A, and similar topic clusters (b) In cluster centre examination question be H.Number is the similarity of examination question and cluster centre examination question in figure.

It should be noted that by taking similar topic clusters (a) as an example, identical examination question C and examination question P (phase with examination question A similarity It is that 0.9), the similarity between examination question C and examination question P is not necessarily high like degree, phase is not present in clustering method provided by the invention Like the transfer law of degree, the present invention only calculates the weighing edit distance of examination question to be clustered and cluster centre examination question and guarantees them With the similarity of cluster centre examination question, but do not calculate and guarantee weighing edit distance between these examination questions to be clustered and Similarity.

Further, before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes: unified Examination question format, wherein include: to comprising kinds of characters format or formula picture htm test question files carry out Classification and Identification and Context resolution is converted into latex examination question text；By latex examination question text conversion at can normal reading text formatting.

It participates in not only including text, figure, punctuation mark in the examination question of cluster, come particularly with the examination question of mathematic subject It says, further includes many complicated formula in mathematics examination question, these formula are probably derived from different formula editors, alternatively, respectively A examination question supplier sometimes can respectively submit the text formatting of oneself some definition, and there are many mistakes and noise in the inside, cause There is larger difficulty in comparison between examination question.In order to adequately be utilized to a large amount of examination question, in the present invention first To participate in cluster examination question carry out format unification, be converted into can normal reading text formatting, can be realized to it is all not With the utilization of format examination question.

Based on the same inventive concept, the present invention also provides a kind of De-weight method of examination question, Fig. 3 provides for the embodiment of the present invention Examination question De-weight method flow chart, as shown in Figure 3, comprising:

Step S301: the examination question in duplicate removal examination question group is treated using the clustering method of any one examination question provided by the invention Carry out clustering processing；

Step S302: the examination question for belonging to same examination question class with cluster centre examination question is deleted.

Based on the clustering method of examination question provided by the invention, the preset threshold used when by judging similarity is set Set, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities almost can be with It is determined as the repetition examination question compared with cluster centre examination question, will only retains cluster centre examination after the very high examination question removal of similarity Topic can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.

Based on the same inventive concept, the present invention also provides a kind of clustering system of examination question, Fig. 4 provides for the embodiment of the present invention Examination question clustering system block diagram one, as shown in Figure 4, comprising: cluster centre examination question choose module 11, important key character determine Module 12, weighing edit distance computing module 13, similarity calculation module 14 and examination question classifying module 15；Wherein,

Cluster centre examination question chooses module 11, is connected with important key character determining module 12, in all participations Cluster centre examination question is chosen in the examination question of cluster, and the cluster centre examination question of selection is sent to important key character determining module 12。

Important key character determining module 12, is connected with weighing edit distance computing module 13, for determining in cluster The important key character of heart examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second character String, and the first character string and the second character string are sent to weighing edit distance computing module 13, important key character be it is newly-increased, The character of examination question meaning or type can be changed after replacement or modification；

Weighing edit distance computing module 13 is connected with similarity calculation module 14, for calculate the first character string and Weighing edit distance between second character string, and weighing edit distance is sent to similarity calculation module 14, weighting editor The least weighting operations number that distance mutually converts between the first character string and the second character string.

Similarity calculation module is connected 14, is connected with examination question classifying module 15, based on according to weighing edit distance The similarity between examination question and cluster centre examination question to be clustered is calculated, and similarity is sent to examination question classifying module 15, wherein phase Like the calculation formula of degree r are as follows: r=(sum-dist)/sum, wherein sum is total for the length of the first character string and the second character string With dist is weighing edit distance；

Examination question classifying module 15, the examination question to be clustered for similarity to be greater than preset threshold are classified as with cluster centre examination question Same examination question class.

In one embodiment, Fig. 5 is the clustering system block diagram two of examination question provided in an embodiment of the present invention, as shown in figure 5, The clustering system of examination question further includes uniform format module 16, uniform format module 16, with cluster centre examination question choose module 11 and Important key character determining module 12 is respectively connected with, for the htm examination question comprising kinds of characters format or formula picture File carries out Classification and Identification and Context resolution, is converted into latex examination question text；It is also used to latex examination question text conversion at can The text formatting of normal reading.It participates in the examination question of cluster including not only text, figure, punctuation mark, particularly with mathematics section It further include many complicated formula for purpose examination question, in mathematics examination question, these formula are probably derived from different equation editings Device, alternatively, each examination question supplier sometimes can respectively submit the text formattings of oneself some definition, the inside have many mistakes and Noise causes the comparison between examination question larger difficulty occur.In order to adequately be utilized to a large amount of examination question, this hair The unification for carrying out format in bright to the examination question for participating in cluster first, be converted into can normal reading text formatting, can be realized pair The utilization of all different-format examination questions.

Important key character determining module 12 further includes that character repertoire constructs module 121 and character string determining module 122, In, character repertoire constructs module 121, for constructing important keyword character library using term frequency-inverse document frequency technology；Character string is true Cover half block 122, for determining the first character string and the second character string according to important keyword character library.

Based on the same inventive concept, the present invention provides a kind of machining system of examination question, and Fig. 6 is provided in an embodiment of the present invention The machining system block diagram of examination question further includes examination as shown in fig. 6, including the clustering system of any one examination question provided by the invention Deduplication module 17 is inscribed, examination question deduplication module 17 is connected with examination question classifying module 15, for receiving the transmission of examination question classifying module Examination question classification results, and delete the examination question for belonging to same examination question class with cluster centre examination question.

Through the foregoing embodiment it is found that the clustering method of examination question provided by the invention, De-weight method and system, are at least realized It is following the utility model has the advantages that

Although some specific embodiments of the invention are described in detail by example, the skill of this field Art personnel it should be understood that example above merely to being illustrated, the range being not intended to be limiting of the invention.The skill of this field Art personnel are it should be understood that can without departing from the scope and spirit of the present invention modify to above embodiments.This hair Bright range is defined by the following claims.

Claims

1. a kind of clustering method of examination question characterized by comprising

It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important key of examination question to be clustered Character is denoted as the second character string, and the important key character is newly-increased, replacement or can change examination question meaning or class after modifying The character of type；

The weighing edit distance between first character string and second character string is calculated, the weighing edit distance is institute State the least weighting operations number mutually converted between the first character string and second character string；

The similarity between the examination question to be clustered and the cluster centre examination question is calculated according to the weighing edit distance, In, the calculation formula of similarity r are as follows:

R=(sum-dist)/sum, wherein sum is the length summation of first character string and second character string, dist For the weighing edit distance；

2. clustering method according to claim 1, which is characterized in that

Before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes:

Unified examination question format, wherein include:

Classification and Identification and Context resolution are carried out to the htm test question files comprising kinds of characters format or formula picture, are converted into Latex examination question text；

3. clustering method according to claim 1, which is characterized in that

The step of choosing cluster centre examination question in all examination questions for participating in cluster specifically includes:

4. clustering method according to claim 1, which is characterized in that

It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important key of examination question to be clustered Character is denoted as the step of the second character string and includes:

5. clustering method according to claim 1, which is characterized in that

The operation of the weighing edit distance includes: insertion, deletion, replacement；Wherein, when calculating weighting operations number: deleting It is denoted as once-through operation, insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.

6. a kind of De-weight method of examination question characterized by comprising

The examination question in duplicate removal examination question group is treated using the clustering method of examination question described in any one of claim 1 to 5 to be clustered Processing；

7. a kind of clustering system of examination question characterized by comprising it is true that cluster centre examination question chooses module, important key character Cover half block, weighing edit distance computing module, similarity calculation module and examination question classifying module；Wherein,

The cluster centre examination question chooses module, is connected with the important key character determining module, in all participations Cluster centre examination question is chosen in the examination question of cluster, and the cluster centre examination question of selection is sent to the important key character Determining module；

The important key character determining module, is connected with the weighing edit distance computing module, described poly- for determining The important key character of class center examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second word Symbol string, and first character string and second character string are sent to the weighing edit distance computing module, it is described heavy Key character is wanted to be newly-increased, replacement or the character of examination question meaning or type can be changed after modifying；

The weighing edit distance computing module, is connected with the similarity calculation module, for calculating first character Weighing edit distance between string and second character string, and the weighing edit distance is sent to the similarity calculation Module, the least weighting that the weighing edit distance mutually converts between first character string and second character string Number of operations；

The similarity calculation module is connected with examination question classifying module, for according to weighing edit distance calculating Similarity between examination question to be clustered and the cluster centre examination question, and the similarity is sent to the examination question and sorts out mould Block, wherein the calculation formula of similarity r are as follows: r=(sum-dist)/sum, wherein sum is first character string and described The length summation of second character string, dist are the weighing edit distance；

The examination question classifying module, the examination question to be clustered and the cluster centre for similarity to be greater than to preset threshold try Topic is classified as same examination question class.

8. clustering system according to claim 7, which is characterized in that

It further include uniform format module, the uniform format module chooses module and described important with the cluster centre examination question Key character determining module is respectively connected with, for the htm test question files comprising kinds of characters format or formula picture into Row Classification and Identification and Context resolution are converted into latex examination question text；It is also used to latex examination question text conversion at can normally read The text formatting of reading.

9. clustering system according to claim 7, which is characterized in that

The important key character determining module further includes character repertoire building module and character string determining module, wherein

The character string determining module, for determining first character string and described according to the important keyword character library Two character strings.

10. a kind of machining system of examination question, which is characterized in that the cluster system including the described in any item examination questions of claim 7 to 9 System, further includes examination question deduplication module, the examination question deduplication module is connected with the examination question classifying module, for receiving the examination The examination question classification results that classifying module is sent are inscribed, and delete the examination question for belonging to same examination question class with the cluster centre examination question.