CN110390019A - A kind of clustering method of examination question, De-weight method and system - Google Patents
A kind of clustering method of examination question, De-weight method and system Download PDFInfo
- Publication number
- CN110390019A CN110390019A CN201910680927.7A CN201910680927A CN110390019A CN 110390019 A CN110390019 A CN 110390019A CN 201910680927 A CN201910680927 A CN 201910680927A CN 110390019 A CN110390019 A CN 110390019A
- Authority
- CN
- China
- Prior art keywords
- examination question
- character
- character string
- examination
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000005303 weighing Methods 0.000 claims abstract description 55
- 230000008859 change Effects 0.000 claims abstract description 8
- 238000013441 quality evaluation Methods 0.000 claims description 14
- 239000004816 latex Substances 0.000 claims description 12
- 229920000126 latex Polymers 0.000 claims description 12
- 238000012217 deletion Methods 0.000 claims description 8
- 230000037430 deletion Effects 0.000 claims description 8
- 238000005516 engineering process Methods 0.000 claims description 8
- 238000003780 insertion Methods 0.000 claims description 8
- 230000037431 insertion Effects 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000003754 machining Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- Educational Technology (AREA)
- Educational Administration (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- General Health & Medical Sciences (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Human Resources & Organizations (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of clustering method of examination question, De-weight method and systems.The clustering method of examination question, comprising: choose cluster centre examination question in all examination questions for participating in cluster;It determines that the important key character of cluster centre examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second character string, important key character is newly-increased, replacement or can change the character of examination question meaning or type after modifying;Calculate the weighing edit distance between the first character string and the second character string, the least weighting operations number that weighing edit distance mutually converts between the first character string and the second character string;The similarity between examination question and cluster centre examination question to be clustered is calculated according to weighing edit distance;Similarity is greater than the examination question to be clustered of preset threshold and cluster centre examination question is classified as same examination question class.The present invention, which can be realized, efficiently clusters extensive examination question.
Description
Technical field
The present invention relates to field of Educational Technology, more particularly, to a kind of clustering method of examination question, De-weight method and are
System.
Background technique
Different examination question supplier in education sector, such as Test Centre, Jiao Fu publisher, training organization and each
The teacher that sets a question of school can provide a large amount of examination question.As digital information is in the application of education sector, the supply of these examination questions
Quotient can also be had unavoidably in these a large amount of examination questions using providing a user examination question by the way of line platform or terminal software
The examination question of many same types either high examination question of similarity.
Therefore it provides a kind of clustering method of examination question, De-weight method and system, realize and efficiently carry out to extensive examination question
Cluster, is this field technical problem urgently to be resolved.
Summary of the invention
In view of this, solving above-mentioned technology the present invention provides a kind of clustering method of examination question, De-weight method and system
Problem.
In a first aspect, the present invention provides a kind of clustering method of examination question, comprising:
Cluster centre examination question is chosen in all examination questions for participating in cluster;
It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important of examination question to be clustered
Key character is denoted as the second character string, the important key character be can change after newly-increased, replacement or modification examination question meaning or
The character of person's type;
Calculate the weighing edit distance between first character string and second character string, the weighing edit distance
The least weighting operations number mutually converted between first character string and second character string;
The similarity between the examination question to be clustered and the cluster centre examination question is calculated according to the weighing edit distance,
Wherein, the calculation formula of similarity r are as follows:
R=(sum-dist)/sum, wherein sum is the length summation of first character string and second character string,
Dist is the weighing edit distance;
Similarity is greater than the examination question to be clustered of preset threshold and the cluster centre examination question is classified as same examination question class.
Optionally, before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes:
Unified examination question format, wherein include:
Classification and Identification and Context resolution are carried out to the htm test question files comprising kinds of characters format or formula picture, turned
Change latex examination question text into;
By latex examination question text conversion at can normal reading text formatting.
Optionally, the step of choosing cluster centre examination question in all examination questions for participating in cluster specifically includes:
According to the quality evaluation of the creation time of examination question and examination question, all examination questions for participating in cluster are ranked up;
Selected and sorted is primary examination question as the cluster centre examination question.
Optionally, it determines that the important key character of the cluster centre examination question is denoted as the first character string, determines examination to be clustered
Important key character the step of being denoted as the second character string of topic includes:
Important keyword character library is constructed using term frequency-inverse document frequency model;
First character string and second character string are determined according to the important keyword character library.
Optionally, the operation of the weighing edit distance includes: insertion, deletion, replacement;Wherein, weighting operations are being calculated
When number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.
Second aspect, the present invention also provides a kind of De-weight methods of examination question, comprising: using it is provided by the invention any one
The clustering method of examination question treats the examination question in duplicate removal examination question group and carries out clustering processing;
Delete the examination question for belonging to same examination question class with the cluster centre examination question.
The third aspect, the present invention provide a kind of clustering system of examination question, comprising: cluster centre examination question chooses module, important
Key character determining module, weighing edit distance computing module, similarity calculation module and examination question classifying module;Wherein,
The cluster centre examination question chooses module, is connected with the important key character determining module, for all
It participates in choosing cluster centre examination question in the examination question of cluster, and the cluster centre examination question of selection is sent to the important key
Character determining module;
The important key character determining module, is connected, for determining with the weighing edit distance computing module
The important key character for stating cluster centre examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as
Two character strings, and first character string and second character string are sent to the weighing edit distance computing module, institute
State the character that important key character is newly-increased, replacement or can change examination question meaning or type after modifying;
The weighing edit distance computing module, is connected with the similarity calculation module, for calculating described first
Weighing edit distance between character string and second character string, and the weighing edit distance is sent to the similarity
Computing module, the weighing edit distance mutually convert least between first character string and second character string
Weighting operations number;
The similarity calculation module is connected with examination question classifying module, for being calculated according to the weighing edit distance
Similarity between the examination question to be clustered and the cluster centre examination question, and the similarity is sent to the examination question and is sorted out
Module, wherein the calculation formula of similarity r are as follows: r=(sum-dist)/sum, wherein sum is first character string and institute
The length summation of the second character string is stated, dist is the weighing edit distance;
The examination question classifying module, for being greater than similarity in the examination question to be clustered and the cluster of preset threshold
Heart examination question is classified as same examination question class.
It optionally, further include uniform format module, the uniform format module chooses module with the cluster centre examination question
It is respectively connected with the important key character determining module, for the htm comprising kinds of characters format or formula picture
Test question files carry out Classification and Identification and Context resolution, are converted into latex examination question text;It is also used to latex examination question text conversion
At can normal reading text formatting.
Optionally, the important key character determining module further includes that character repertoire constructs module and character string determining module,
Wherein,
The character repertoire constructs module, for constructing important keyword character library using term frequency-inverse document frequency technology;
The character string determining module, for determining first character string and institute according to the important keyword character library
State the second character string.
Fourth aspect, the present invention provide a kind of machining system of examination question, including any one examination question provided by the invention
Clustering system further includes examination question deduplication module, and the examination question deduplication module is connected with the examination question classifying module, for receiving
The examination question classification results that the examination question classifying module is sent, and delete the examination for belonging to same examination question class with the cluster centre examination question
Topic.
Compared with prior art, the clustering method of examination question provided by the invention, De-weight method and system, at least realize as
Under the utility model has the advantages that
(1) clustering method of examination question provided by the invention chooses cluster centre in all examination questions for participating in cluster first
Examination question is then based on the important keyword in cluster centre examination question and examination question to be clustered as weight, calculates cluster centre examination question
Weighing edit distance between the important keyword in examination question to be clustered, and then calculate cluster centre examination question and examination question to be clustered
Similarity, Lai Shixian clustering, the editing distance between the shorter character string of use calculates to evaluate similarity, Neng Goujian
Change calculating process, and can be realized and extensive examination question is efficiently clustered.
(2) clustering method based on examination question provided by the invention, the preset threshold used when by judging similarity into
Row setting, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities is almost
It can be determined that as the repetition examination question compared with cluster centre examination question, will only retain in cluster after the very high examination question removal of similarity
Heart examination question can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.
Certainly, implementing any of the products of the present invention specific needs while must not reach all the above technical effect.
By referring to the drawings to the detailed description of exemplary embodiment of the present invention, other feature of the invention and its
Advantage will become apparent.
Detailed description of the invention
It is combined in the description and the attached drawing for constituting part of specification shows the embodiment of the present invention, and even
With its explanation together principle for explaining the present invention.
Fig. 1 is the clustering method flow chart of examination question provided in an embodiment of the present invention;
Fig. 2 is the examination question cluster schematic diagram generated using clustering method provided by the invention;
Fig. 3 is the De-weight method flow chart of examination question provided in an embodiment of the present invention;
Fig. 4 is the clustering system block diagram one of examination question provided in an embodiment of the present invention;
Fig. 5 is the clustering system block diagram two of examination question provided in an embodiment of the present invention;
Fig. 6 is the machining system block diagram of examination question provided in an embodiment of the present invention.
Specific embodiment
Carry out the various exemplary embodiments of detailed description of the present invention now with reference to attached drawing.It should also be noted that unless in addition having
Body explanation, the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally
The range of invention.
Be to the description only actually of at least one exemplary embodiment below it is illustrative, never as to the present invention
And its application or any restrictions used.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable
In the case of, the technology, method and apparatus should be considered as part of specification.
It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without
It is as limitation.Therefore, other examples of exemplary embodiment can have different values.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.
In the related technology, for the examination question of negligible amounts, it can be compared one by one based on editing distance, judge each examination
The similarity degree of topic, and remove the high examination question of similarity.Wherein, when calculating the editing distance, it usually needs two examination questions of statistics
Full text between the number of operations of needs that mutually converts, calculating process is relative complex.Based on this, the present invention provides one kind
Based on the important keyword in examination question as weight, and weighing edit distance is calculated to evaluate the clustering method of similarity, use
Editing distance between shorter character string calculates to evaluate similarity, can simplify calculating process, and can be realized to efficient
Extensive examination question is clustered, and be further able to using the method realize to examination question carry out duplicate removal.
The present invention provides a kind of clustering method of examination question, and Fig. 1 is the clustering method stream of examination question provided in an embodiment of the present invention
Cheng Tu, as shown in Figure 1, the clustering method of examination question includes:
Step S101: cluster centre examination question is chosen in all examination questions for participating in cluster.
Optionally, can according to the creation time of examination question and the quality evaluation of examination question, to all examination questions for participating in cluster into
Row is integrated ordered;Then it selects integrated ordered to be primary examination question as cluster centre examination question.Wherein, quality evaluation can be used
The difficulty value and examination question discrimination of examination question are measured, and the difficulty value of examination question is average rate of the subject for the examination question, are led to
It often needs in 0.2~0.8 this section, including endpoint value, if the higher grade of quality evaluation in this section.
Examination question discrimination refers to examination question to the size of the resolution capability of subject's ' Current Knowledge Regarding, and examination question discrimination is higher, then quality
The higher grade of evaluation.Integrated ordered to carry out with the creation time and quality evaluation of examination question, wherein quality evaluation is better than creation
Time, for example, the then higher examination question sequence of quality evaluation rank is located further forward when examination question creation time is identical.
Step S102: it determines that the important key character of cluster centre examination question is denoted as the first character string, determines examination question to be clustered
Important key character be denoted as the second character string, important key character is that can change examination question meaning after newly-increased, replacement or modification
Or the character of type.
By taking following mathematics examination questions as an example: known vector→A=(cos3x/4, sin3x/4),→B=(cos (π/3 x/4+) ,-
sin(x/4+π/3));Enable f (x)=(→a+→B) f (x) analytic expression and monotonic increase section are asked in ^2, (1);(2) if x ∈ [- π/6,
5 π/6], find a function the maximum value and minimum value of f (x);(3) if f (x)=5/2, the value of (π/6 x-) sin is sought.
In above-mentioned examination question, " vector " and vector symbol " → " are significant for the knowledge point mark of topic, if newly
Increase, important keyword as replacement and deletion, the meaning and classification of meeting significant modification topic.It therefore, will be such in the present invention
Important keyword assigns bigger weight in weighing edit distance.
Optionally, important keyword character library is constructed using term frequency-inverse document frequency model in the present invention;With same subject
A large amount of examination question be data basis (such as 1,000,000 problems), using term frequency-inverse document frequency model in a large amount of examination question
Important keyword is picked out, the important keyword character library for covering all knowledge points in subject substantially is formed;Then according to important
Keyword character library determines the second character string in the first character string and examination question to be clustered in cluster centre examination question.First character
String and weight of second character string as weighing edit distance.Based on a large amount of examination questions, important pass is picked out according to model
Key word can guarantee the accuracy of important keyword selection in clustering method, and then guarantee the accuracy of subsequent similarity calculation.
Step S103: calculating the weighing edit distance between the first character string and the second character string, and weighing edit distance is
The least weighting operations number mutually converted between first character string and the second character string.
Optionally, in clustering method provided by the invention, the operation of weighing edit distance includes: insertion, deletion, replacement;
Wherein, when calculating weighting operations number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.
The weight for the weighing edit distance that the present invention uses will affect examination question due to important keyword for the important keyword in examination question
Meaning or type replace important keyword so being denoted as replacement and operating twice when calculating weighting operations number to increase
Influence in number of operations promotes the accuracy of subsequent similarity calculation.
Step S104: calculating the similarity between examination question and cluster centre examination question to be clustered according to weighing edit distance,
In, the calculation formula of similarity r are as follows:
R=(sum-dist)/sum, wherein sum is the length summation of the first character string and the second character string, and dist is
Weighing edit distance;
Step S105: similarity is greater than the examination question to be clustered of preset threshold and cluster centre examination question is classified as same examination question
Class.
Further, clustering method further include: when the similarity of examination question to be clustered is respectively less than or is equal to preset threshold,
By cluster centre examination question separately as an examination question class.
Optionally, preset threshold 0.8.It, will be in examination question to be clustered and cluster when the similarity being calculated is greater than 0.8
Heart examination question is classified as same examination question class;When the similarity being calculated is less than or equal to 0.8, by examination question to be clustered and cluster
Center examination question is divided into different examination question classes.
Step S102 is after having chosen cluster centre examination question, calculate to an examination question to be clustered similar to step 105
The process spent and sorted out can generate one after being performed both by step S102 to step 105 to all examination questions to be clustered with poly-
Examination question cluster centered on the examination question of class center, examination question and the similarity of cluster centre examination question in this examination question cluster compared with
It is high.
Further, after generating an examination question cluster centered on cluster centre examination question, can not return to remaining
The examination question for entering the cluster carries out clustering next time.Optionally, continue one cluster centre examination of selection in remaining examination question
Topic then proceedes to execute step S102 to step 105, ultimately generates another examination question cluster.And so on, until will be all
Examination question to be clustered carries out clustering.
There is similarity to be higher than the examination question of preset threshold around each cluster centre examination question, it is poly- to form a similar topic
Class, the data of examination question are different in the same similar size inscribed in cluster according to practical examination question data and preset threshold, cluster.
Fig. 2 is the examination question cluster schematic diagram generated using clustering method provided by the invention.As shown in Fig. 2, similar topic cluster (a) and phase
Different preset thresholds is used like topic cluster (b).Cluster centre examination question in similar topic cluster (a) is A, and similar topic clusters (b)
In cluster centre examination question be H.Number is the similarity of examination question and cluster centre examination question in figure.
It should be noted that by taking similar topic clusters (a) as an example, identical examination question C and examination question P (phase with examination question A similarity
It is that 0.9), the similarity between examination question C and examination question P is not necessarily high like degree, phase is not present in clustering method provided by the invention
Like the transfer law of degree, the present invention only calculates the weighing edit distance of examination question to be clustered and cluster centre examination question and guarantees them
With the similarity of cluster centre examination question, but do not calculate and guarantee weighing edit distance between these examination questions to be clustered and
Similarity.
Further, before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes: unified
Examination question format, wherein include: to comprising kinds of characters format or formula picture htm test question files carry out Classification and Identification and
Context resolution is converted into latex examination question text;By latex examination question text conversion at can normal reading text formatting.
It participates in not only including text, figure, punctuation mark in the examination question of cluster, come particularly with the examination question of mathematic subject
It says, further includes many complicated formula in mathematics examination question, these formula are probably derived from different formula editors, alternatively, respectively
A examination question supplier sometimes can respectively submit the text formatting of oneself some definition, and there are many mistakes and noise in the inside, cause
There is larger difficulty in comparison between examination question.In order to adequately be utilized to a large amount of examination question, in the present invention first
To participate in cluster examination question carry out format unification, be converted into can normal reading text formatting, can be realized to it is all not
With the utilization of format examination question.
Based on the same inventive concept, the present invention also provides a kind of De-weight method of examination question, Fig. 3 provides for the embodiment of the present invention
Examination question De-weight method flow chart, as shown in Figure 3, comprising:
Step S301: the examination question in duplicate removal examination question group is treated using the clustering method of any one examination question provided by the invention
Carry out clustering processing;
Step S302: the examination question for belonging to same examination question class with cluster centre examination question is deleted.
Based on the clustering method of examination question provided by the invention, the preset threshold used when by judging similarity is set
Set, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities almost can be with
It is determined as the repetition examination question compared with cluster centre examination question, will only retains cluster centre examination after the very high examination question removal of similarity
Topic can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.
Based on the same inventive concept, the present invention also provides a kind of clustering system of examination question, Fig. 4 provides for the embodiment of the present invention
Examination question clustering system block diagram one, as shown in Figure 4, comprising: cluster centre examination question choose module 11, important key character determine
Module 12, weighing edit distance computing module 13, similarity calculation module 14 and examination question classifying module 15;Wherein,
Cluster centre examination question chooses module 11, is connected with important key character determining module 12, in all participations
Cluster centre examination question is chosen in the examination question of cluster, and the cluster centre examination question of selection is sent to important key character determining module
12。
Optionally, can according to the creation time of examination question and the quality evaluation of examination question, to all examination questions for participating in cluster into
Row is integrated ordered;Then it selects integrated ordered to be primary examination question as cluster centre examination question.Wherein, quality evaluation can be used
The difficulty value and examination question discrimination of examination question are measured, and the difficulty value of examination question is average rate of the subject for the examination question, are led to
It often needs in 0.2~0.8 this section, including endpoint value, if the higher grade of quality evaluation in this section.
Examination question discrimination refers to examination question to the size of the resolution capability of subject's ' Current Knowledge Regarding, and examination question discrimination is higher, then quality
The higher grade of evaluation.Integrated ordered to carry out with the creation time and quality evaluation of examination question, wherein quality evaluation is better than creation
Time, for example, the then higher examination question sequence of quality evaluation rank is located further forward when examination question creation time is identical.
Important key character determining module 12, is connected with weighing edit distance computing module 13, for determining in cluster
The important key character of heart examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second character
String, and the first character string and the second character string are sent to weighing edit distance computing module 13, important key character be it is newly-increased,
The character of examination question meaning or type can be changed after replacement or modification;
Weighing edit distance computing module 13 is connected with similarity calculation module 14, for calculate the first character string and
Weighing edit distance between second character string, and weighing edit distance is sent to similarity calculation module 14, weighting editor
The least weighting operations number that distance mutually converts between the first character string and the second character string.
Optionally, in clustering method provided by the invention, the operation of weighing edit distance includes: insertion, deletion, replacement;
Wherein, when calculating weighting operations number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.
The weight for the weighing edit distance that the present invention uses will affect examination question due to important keyword for the important keyword in examination question
Meaning or type replace important keyword so being denoted as replacement and operating twice when calculating weighting operations number to increase
Influence in number of operations promotes the accuracy of subsequent similarity calculation.
Similarity calculation module is connected 14, is connected with examination question classifying module 15, based on according to weighing edit distance
The similarity between examination question and cluster centre examination question to be clustered is calculated, and similarity is sent to examination question classifying module 15, wherein phase
Like the calculation formula of degree r are as follows: r=(sum-dist)/sum, wherein sum is total for the length of the first character string and the second character string
With dist is weighing edit distance;
Examination question classifying module 15, the examination question to be clustered for similarity to be greater than preset threshold are classified as with cluster centre examination question
Same examination question class.
In one embodiment, Fig. 5 is the clustering system block diagram two of examination question provided in an embodiment of the present invention, as shown in figure 5,
The clustering system of examination question further includes uniform format module 16, uniform format module 16, with cluster centre examination question choose module 11 and
Important key character determining module 12 is respectively connected with, for the htm examination question comprising kinds of characters format or formula picture
File carries out Classification and Identification and Context resolution, is converted into latex examination question text;It is also used to latex examination question text conversion at can
The text formatting of normal reading.It participates in the examination question of cluster including not only text, figure, punctuation mark, particularly with mathematics section
It further include many complicated formula for purpose examination question, in mathematics examination question, these formula are probably derived from different equation editings
Device, alternatively, each examination question supplier sometimes can respectively submit the text formattings of oneself some definition, the inside have many mistakes and
Noise causes the comparison between examination question larger difficulty occur.In order to adequately be utilized to a large amount of examination question, this hair
The unification for carrying out format in bright to the examination question for participating in cluster first, be converted into can normal reading text formatting, can be realized pair
The utilization of all different-format examination questions.
Important key character determining module 12 further includes that character repertoire constructs module 121 and character string determining module 122,
In, character repertoire constructs module 121, for constructing important keyword character library using term frequency-inverse document frequency technology;Character string is true
Cover half block 122, for determining the first character string and the second character string according to important keyword character library.
Optionally, important keyword character library is constructed using term frequency-inverse document frequency model in the present invention;With same subject
A large amount of examination question be data basis (such as 1,000,000 problems), using term frequency-inverse document frequency model in a large amount of examination question
Important keyword is picked out, the important keyword character library for covering all knowledge points in subject substantially is formed;Then according to important
Keyword character library determines the second character string in the first character string and examination question to be clustered in cluster centre examination question.First character
String and weight of second character string as weighing edit distance.Based on a large amount of examination questions, important pass is picked out according to model
Key word can guarantee the accuracy of important keyword selection in clustering method, and then guarantee the accuracy of subsequent similarity calculation.
Based on the same inventive concept, the present invention provides a kind of machining system of examination question, and Fig. 6 is provided in an embodiment of the present invention
The machining system block diagram of examination question further includes examination as shown in fig. 6, including the clustering system of any one examination question provided by the invention
Deduplication module 17 is inscribed, examination question deduplication module 17 is connected with examination question classifying module 15, for receiving the transmission of examination question classifying module
Examination question classification results, and delete the examination question for belonging to same examination question class with cluster centre examination question.
Through the foregoing embodiment it is found that the clustering method of examination question provided by the invention, De-weight method and system, are at least realized
It is following the utility model has the advantages that
(1) clustering method of examination question provided by the invention chooses cluster centre in all examination questions for participating in cluster first
Examination question is then based on the important keyword in cluster centre examination question and examination question to be clustered as weight, calculates cluster centre examination question
Weighing edit distance between the important keyword in examination question to be clustered, and then calculate cluster centre examination question and examination question to be clustered
Similarity, Lai Shixian clustering, the editing distance between the shorter character string of use calculates to evaluate similarity, Neng Goujian
Change calculating process, and can be realized and extensive examination question is efficiently clustered.
(2) clustering method based on examination question provided by the invention, the preset threshold used when by judging similarity into
Row setting, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities is almost
It can be determined that as the repetition examination question compared with cluster centre examination question, will only retain in cluster after the very high examination question removal of similarity
Heart examination question can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.
Although some specific embodiments of the invention are described in detail by example, the skill of this field
Art personnel it should be understood that example above merely to being illustrated, the range being not intended to be limiting of the invention.The skill of this field
Art personnel are it should be understood that can without departing from the scope and spirit of the present invention modify to above embodiments.This hair
Bright range is defined by the following claims.
Claims (10)
1. a kind of clustering method of examination question characterized by comprising
Cluster centre examination question is chosen in all examination questions for participating in cluster;
It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important key of examination question to be clustered
Character is denoted as the second character string, and the important key character is newly-increased, replacement or can change examination question meaning or class after modifying
The character of type;
The weighing edit distance between first character string and second character string is calculated, the weighing edit distance is institute
State the least weighting operations number mutually converted between the first character string and second character string;
The similarity between the examination question to be clustered and the cluster centre examination question is calculated according to the weighing edit distance,
In, the calculation formula of similarity r are as follows:
R=(sum-dist)/sum, wherein sum is the length summation of first character string and second character string, dist
For the weighing edit distance;
Similarity is greater than the examination question to be clustered of preset threshold and the cluster centre examination question is classified as same examination question class.
2. clustering method according to claim 1, which is characterized in that
Before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes:
Unified examination question format, wherein include:
Classification and Identification and Context resolution are carried out to the htm test question files comprising kinds of characters format or formula picture, are converted into
Latex examination question text;
By latex examination question text conversion at can normal reading text formatting.
3. clustering method according to claim 1, which is characterized in that
The step of choosing cluster centre examination question in all examination questions for participating in cluster specifically includes:
According to the quality evaluation of the creation time of examination question and examination question, all examination questions for participating in cluster are ranked up;
Selected and sorted is primary examination question as the cluster centre examination question.
4. clustering method according to claim 1, which is characterized in that
It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important key of examination question to be clustered
Character is denoted as the step of the second character string and includes:
Important keyword character library is constructed using term frequency-inverse document frequency model;
First character string and second character string are determined according to the important keyword character library.
5. clustering method according to claim 1, which is characterized in that
The operation of the weighing edit distance includes: insertion, deletion, replacement;Wherein, when calculating weighting operations number: deleting
It is denoted as once-through operation, insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.
6. a kind of De-weight method of examination question characterized by comprising
The examination question in duplicate removal examination question group is treated using the clustering method of examination question described in any one of claim 1 to 5 to be clustered
Processing;
Delete the examination question for belonging to same examination question class with the cluster centre examination question.
7. a kind of clustering system of examination question characterized by comprising it is true that cluster centre examination question chooses module, important key character
Cover half block, weighing edit distance computing module, similarity calculation module and examination question classifying module;Wherein,
The cluster centre examination question chooses module, is connected with the important key character determining module, in all participations
Cluster centre examination question is chosen in the examination question of cluster, and the cluster centre examination question of selection is sent to the important key character
Determining module;
The important key character determining module, is connected with the weighing edit distance computing module, described poly- for determining
The important key character of class center examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second word
Symbol string, and first character string and second character string are sent to the weighing edit distance computing module, it is described heavy
Key character is wanted to be newly-increased, replacement or the character of examination question meaning or type can be changed after modifying;
The weighing edit distance computing module, is connected with the similarity calculation module, for calculating first character
Weighing edit distance between string and second character string, and the weighing edit distance is sent to the similarity calculation
Module, the least weighting that the weighing edit distance mutually converts between first character string and second character string
Number of operations;
The similarity calculation module is connected with examination question classifying module, for according to weighing edit distance calculating
Similarity between examination question to be clustered and the cluster centre examination question, and the similarity is sent to the examination question and sorts out mould
Block, wherein the calculation formula of similarity r are as follows: r=(sum-dist)/sum, wherein sum is first character string and described
The length summation of second character string, dist are the weighing edit distance;
The examination question classifying module, the examination question to be clustered and the cluster centre for similarity to be greater than to preset threshold try
Topic is classified as same examination question class.
8. clustering system according to claim 7, which is characterized in that
It further include uniform format module, the uniform format module chooses module and described important with the cluster centre examination question
Key character determining module is respectively connected with, for the htm test question files comprising kinds of characters format or formula picture into
Row Classification and Identification and Context resolution are converted into latex examination question text;It is also used to latex examination question text conversion at can normally read
The text formatting of reading.
9. clustering system according to claim 7, which is characterized in that
The important key character determining module further includes character repertoire building module and character string determining module, wherein
The character repertoire constructs module, for constructing important keyword character library using term frequency-inverse document frequency technology;
The character string determining module, for determining first character string and described according to the important keyword character library
Two character strings.
10. a kind of machining system of examination question, which is characterized in that the cluster system including the described in any item examination questions of claim 7 to 9
System, further includes examination question deduplication module, the examination question deduplication module is connected with the examination question classifying module, for receiving the examination
The examination question classification results that classifying module is sent are inscribed, and delete the examination question for belonging to same examination question class with the cluster centre examination question.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910680927.7A CN110390019A (en) | 2019-07-26 | 2019-07-26 | A kind of clustering method of examination question, De-weight method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910680927.7A CN110390019A (en) | 2019-07-26 | 2019-07-26 | A kind of clustering method of examination question, De-weight method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110390019A true CN110390019A (en) | 2019-10-29 |
Family
ID=68287599
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910680927.7A Pending CN110390019A (en) | 2019-07-26 | 2019-07-26 | A kind of clustering method of examination question, De-weight method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110390019A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112669181A (en) * | 2020-12-29 | 2021-04-16 | 吉林工商学院 | Assessment method for education practice training |
CN112989058A (en) * | 2021-05-10 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Information classification method, test question classification method, device, server and storage medium |
WO2022170985A1 (en) * | 2021-02-09 | 2022-08-18 | 广州视源电子科技股份有限公司 | Exercise selection method and apparatus, and computer device and storage medium |
CN118132733A (en) * | 2024-05-07 | 2024-06-04 | 江西风向标智能科技有限公司 | Test question retrieval method, system, storage medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7062508B2 (en) * | 2000-09-05 | 2006-06-13 | Leonid Andreev | Method and computer-based system for non-probabilistic hypothesis generation and verification |
CN102629272A (en) * | 2012-03-14 | 2012-08-08 | 北京邮电大学 | Clustering based optimization method for examination system database |
CN105373594A (en) * | 2015-10-23 | 2016-03-02 | 广东小天才科技有限公司 | Method and apparatus for screening repeated test questions from question bank |
CN105824798A (en) * | 2016-03-03 | 2016-08-03 | 云南电网有限责任公司教育培训评价中心 | Examination question de-duplicating method of examination question base based on examination question key word likeness |
CN108898170A (en) * | 2018-06-19 | 2018-11-27 | 江苏中盈高科智能信息股份有限公司 | A kind of intelligent Auto-generating Test Paper method based on fuzzy cluster analysis |
CN109271401A (en) * | 2018-09-26 | 2019-01-25 | 杭州大拿科技股份有限公司 | Method, apparatus, electronic equipment and storage medium are corrected in a kind of search of topic |
-
2019
- 2019-07-26 CN CN201910680927.7A patent/CN110390019A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7062508B2 (en) * | 2000-09-05 | 2006-06-13 | Leonid Andreev | Method and computer-based system for non-probabilistic hypothesis generation and verification |
CN102629272A (en) * | 2012-03-14 | 2012-08-08 | 北京邮电大学 | Clustering based optimization method for examination system database |
CN105373594A (en) * | 2015-10-23 | 2016-03-02 | 广东小天才科技有限公司 | Method and apparatus for screening repeated test questions from question bank |
CN105824798A (en) * | 2016-03-03 | 2016-08-03 | 云南电网有限责任公司教育培训评价中心 | Examination question de-duplicating method of examination question base based on examination question key word likeness |
CN108898170A (en) * | 2018-06-19 | 2018-11-27 | 江苏中盈高科智能信息股份有限公司 | A kind of intelligent Auto-generating Test Paper method based on fuzzy cluster analysis |
CN109271401A (en) * | 2018-09-26 | 2019-01-25 | 杭州大拿科技股份有限公司 | Method, apparatus, electronic equipment and storage medium are corrected in a kind of search of topic |
Non-Patent Citations (1)
Title |
---|
疯狂的小猪: "Python Levenshtein 计算文本之间的距离", 《HTTPS://BLOG.CSDN.NET/U014657795/ARTICLE/DETAILS/90476489》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112669181A (en) * | 2020-12-29 | 2021-04-16 | 吉林工商学院 | Assessment method for education practice training |
CN112669181B (en) * | 2020-12-29 | 2023-06-30 | 吉林工商学院 | Assessment method for education practice training |
WO2022170985A1 (en) * | 2021-02-09 | 2022-08-18 | 广州视源电子科技股份有限公司 | Exercise selection method and apparatus, and computer device and storage medium |
CN112989058A (en) * | 2021-05-10 | 2021-06-18 | 腾讯科技(深圳)有限公司 | Information classification method, test question classification method, device, server and storage medium |
CN118132733A (en) * | 2024-05-07 | 2024-06-04 | 江西风向标智能科技有限公司 | Test question retrieval method, system, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110390019A (en) | A kind of clustering method of examination question, De-weight method and system | |
Huang et al. | Automating intention mining | |
Wartena et al. | Topic detection by clustering keywords | |
US20170116203A1 (en) | Method of automated discovery of topic relatedness | |
US9305083B2 (en) | Author disambiguation | |
Felix et al. | The exploratory labeling assistant: Mixed-initiative label curation with large document collections | |
CN110276456A (en) | A kind of machine learning model auxiliary construction method, system, equipment and medium | |
Srikanth et al. | Extractive text summarization using dynamic clustering and co-reference on BERT | |
CN111930792A (en) | Data resource labeling method and device, storage medium and electronic equipment | |
CN108681564A (en) | The determination method, apparatus and computer readable storage medium of keyword and answer | |
JP2020512651A (en) | Search method, device, and non-transitory computer-readable storage medium | |
CN103514279A (en) | Method and device for classifying sentence level emotion | |
JP2006190229A (en) | Opinion extraction learning device and opinion extraction classifying device | |
CN117217315A (en) | Method and device for generating high-quality question-answer data by using large language model | |
Alrasheed | Word synonym relationships for text analysis: A graph-based approach | |
Fauzan et al. | A novel approach to automated behavioral diagram assessment using label similarity and subgraph edit distance | |
Amarasinghe et al. | Generative pre-trained transformers for coding text data? An analysis with classroom orchestration data | |
Vu et al. | Revising FUNSD dataset for key-value detection in document images | |
CN109240549B (en) | Calligraphy corrector based on external digital equipment and big data intelligent analysis | |
Zuin et al. | Automatic tag recommendation for painting artworks using diachronic descriptions | |
Lubis et al. | Improving course review helpfulness Prediction through sentiment analysis | |
CN113901793A (en) | Event extraction method and device combining RPA and AI | |
Filzmoser et al. | What computers can tell us about emotions–classification of affective communication in electronic negotiations by supervised machine learning | |
Setzu et al. | Explainable authorship identification in cultural heritage applications: Analysis of a new perspective | |
Reiki et al. | Comparison of Term Weighting Methods in Sentiment Analysis of the New State Capital of Indonesia with the SVM Method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191029 |
|
RJ01 | Rejection of invention patent application after publication |