CN110390019A - A kind of clustering method of examination question, De-weight method and system - Google Patents

A kind of clustering method of examination question, De-weight method and system Download PDF

Info

Publication number
CN110390019A
CN110390019A CN201910680927.7A CN201910680927A CN110390019A CN 110390019 A CN110390019 A CN 110390019A CN 201910680927 A CN201910680927 A CN 201910680927A CN 110390019 A CN110390019 A CN 110390019A
Authority
CN
China
Prior art keywords
examination question
character
character string
examination
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910680927.7A
Other languages
Chinese (zh)
Inventor
谢楚鹏
李可佳
郭晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Qusu Education Technology Co Ltd
Original Assignee
Jiangsu Qusu Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Qusu Education Technology Co Ltd filed Critical Jiangsu Qusu Education Technology Co Ltd
Priority to CN201910680927.7A priority Critical patent/CN110390019A/en
Publication of CN110390019A publication Critical patent/CN110390019A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of clustering method of examination question, De-weight method and systems.The clustering method of examination question, comprising: choose cluster centre examination question in all examination questions for participating in cluster;It determines that the important key character of cluster centre examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second character string, important key character is newly-increased, replacement or can change the character of examination question meaning or type after modifying;Calculate the weighing edit distance between the first character string and the second character string, the least weighting operations number that weighing edit distance mutually converts between the first character string and the second character string;The similarity between examination question and cluster centre examination question to be clustered is calculated according to weighing edit distance;Similarity is greater than the examination question to be clustered of preset threshold and cluster centre examination question is classified as same examination question class.The present invention, which can be realized, efficiently clusters extensive examination question.

Description

A kind of clustering method of examination question, De-weight method and system
Technical field
The present invention relates to field of Educational Technology, more particularly, to a kind of clustering method of examination question, De-weight method and are System.
Background technique
Different examination question supplier in education sector, such as Test Centre, Jiao Fu publisher, training organization and each The teacher that sets a question of school can provide a large amount of examination question.As digital information is in the application of education sector, the supply of these examination questions Quotient can also be had unavoidably in these a large amount of examination questions using providing a user examination question by the way of line platform or terminal software The examination question of many same types either high examination question of similarity.
Therefore it provides a kind of clustering method of examination question, De-weight method and system, realize and efficiently carry out to extensive examination question Cluster, is this field technical problem urgently to be resolved.
Summary of the invention
In view of this, solving above-mentioned technology the present invention provides a kind of clustering method of examination question, De-weight method and system Problem.
In a first aspect, the present invention provides a kind of clustering method of examination question, comprising:
Cluster centre examination question is chosen in all examination questions for participating in cluster;
It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important of examination question to be clustered Key character is denoted as the second character string, the important key character be can change after newly-increased, replacement or modification examination question meaning or The character of person's type;
Calculate the weighing edit distance between first character string and second character string, the weighing edit distance The least weighting operations number mutually converted between first character string and second character string;
The similarity between the examination question to be clustered and the cluster centre examination question is calculated according to the weighing edit distance, Wherein, the calculation formula of similarity r are as follows:
R=(sum-dist)/sum, wherein sum is the length summation of first character string and second character string, Dist is the weighing edit distance;
Similarity is greater than the examination question to be clustered of preset threshold and the cluster centre examination question is classified as same examination question class.
Optionally, before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes:
Unified examination question format, wherein include:
Classification and Identification and Context resolution are carried out to the htm test question files comprising kinds of characters format or formula picture, turned Change latex examination question text into;
By latex examination question text conversion at can normal reading text formatting.
Optionally, the step of choosing cluster centre examination question in all examination questions for participating in cluster specifically includes:
According to the quality evaluation of the creation time of examination question and examination question, all examination questions for participating in cluster are ranked up;
Selected and sorted is primary examination question as the cluster centre examination question.
Optionally, it determines that the important key character of the cluster centre examination question is denoted as the first character string, determines examination to be clustered Important key character the step of being denoted as the second character string of topic includes:
Important keyword character library is constructed using term frequency-inverse document frequency model;
First character string and second character string are determined according to the important keyword character library.
Optionally, the operation of the weighing edit distance includes: insertion, deletion, replacement;Wherein, weighting operations are being calculated When number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.
Second aspect, the present invention also provides a kind of De-weight methods of examination question, comprising: using it is provided by the invention any one The clustering method of examination question treats the examination question in duplicate removal examination question group and carries out clustering processing;
Delete the examination question for belonging to same examination question class with the cluster centre examination question.
The third aspect, the present invention provide a kind of clustering system of examination question, comprising: cluster centre examination question chooses module, important Key character determining module, weighing edit distance computing module, similarity calculation module and examination question classifying module;Wherein,
The cluster centre examination question chooses module, is connected with the important key character determining module, for all It participates in choosing cluster centre examination question in the examination question of cluster, and the cluster centre examination question of selection is sent to the important key Character determining module;
The important key character determining module, is connected, for determining with the weighing edit distance computing module The important key character for stating cluster centre examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as Two character strings, and first character string and second character string are sent to the weighing edit distance computing module, institute State the character that important key character is newly-increased, replacement or can change examination question meaning or type after modifying;
The weighing edit distance computing module, is connected with the similarity calculation module, for calculating described first Weighing edit distance between character string and second character string, and the weighing edit distance is sent to the similarity Computing module, the weighing edit distance mutually convert least between first character string and second character string Weighting operations number;
The similarity calculation module is connected with examination question classifying module, for being calculated according to the weighing edit distance Similarity between the examination question to be clustered and the cluster centre examination question, and the similarity is sent to the examination question and is sorted out Module, wherein the calculation formula of similarity r are as follows: r=(sum-dist)/sum, wherein sum is first character string and institute The length summation of the second character string is stated, dist is the weighing edit distance;
The examination question classifying module, for being greater than similarity in the examination question to be clustered and the cluster of preset threshold Heart examination question is classified as same examination question class.
It optionally, further include uniform format module, the uniform format module chooses module with the cluster centre examination question It is respectively connected with the important key character determining module, for the htm comprising kinds of characters format or formula picture Test question files carry out Classification and Identification and Context resolution, are converted into latex examination question text;It is also used to latex examination question text conversion At can normal reading text formatting.
Optionally, the important key character determining module further includes that character repertoire constructs module and character string determining module, Wherein,
The character repertoire constructs module, for constructing important keyword character library using term frequency-inverse document frequency technology;
The character string determining module, for determining first character string and institute according to the important keyword character library State the second character string.
Fourth aspect, the present invention provide a kind of machining system of examination question, including any one examination question provided by the invention Clustering system further includes examination question deduplication module, and the examination question deduplication module is connected with the examination question classifying module, for receiving The examination question classification results that the examination question classifying module is sent, and delete the examination for belonging to same examination question class with the cluster centre examination question Topic.
Compared with prior art, the clustering method of examination question provided by the invention, De-weight method and system, at least realize as Under the utility model has the advantages that
(1) clustering method of examination question provided by the invention chooses cluster centre in all examination questions for participating in cluster first Examination question is then based on the important keyword in cluster centre examination question and examination question to be clustered as weight, calculates cluster centre examination question Weighing edit distance between the important keyword in examination question to be clustered, and then calculate cluster centre examination question and examination question to be clustered Similarity, Lai Shixian clustering, the editing distance between the shorter character string of use calculates to evaluate similarity, Neng Goujian Change calculating process, and can be realized and extensive examination question is efficiently clustered.
(2) clustering method based on examination question provided by the invention, the preset threshold used when by judging similarity into Row setting, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities is almost It can be determined that as the repetition examination question compared with cluster centre examination question, will only retain in cluster after the very high examination question removal of similarity Heart examination question can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.
Certainly, implementing any of the products of the present invention specific needs while must not reach all the above technical effect.
By referring to the drawings to the detailed description of exemplary embodiment of the present invention, other feature of the invention and its Advantage will become apparent.
Detailed description of the invention
It is combined in the description and the attached drawing for constituting part of specification shows the embodiment of the present invention, and even With its explanation together principle for explaining the present invention.
Fig. 1 is the clustering method flow chart of examination question provided in an embodiment of the present invention;
Fig. 2 is the examination question cluster schematic diagram generated using clustering method provided by the invention;
Fig. 3 is the De-weight method flow chart of examination question provided in an embodiment of the present invention;
Fig. 4 is the clustering system block diagram one of examination question provided in an embodiment of the present invention;
Fig. 5 is the clustering system block diagram two of examination question provided in an embodiment of the present invention;
Fig. 6 is the machining system block diagram of examination question provided in an embodiment of the present invention.
Specific embodiment
Carry out the various exemplary embodiments of detailed description of the present invention now with reference to attached drawing.It should also be noted that unless in addition having Body explanation, the unlimited system of component and the positioned opposite of step, numerical expression and the numerical value otherwise illustrated in these embodiments is originally The range of invention.
Be to the description only actually of at least one exemplary embodiment below it is illustrative, never as to the present invention And its application or any restrictions used.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as part of specification.
It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without It is as limitation.Therefore, other examples of exemplary embodiment can have different values.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.
In the related technology, for the examination question of negligible amounts, it can be compared one by one based on editing distance, judge each examination The similarity degree of topic, and remove the high examination question of similarity.Wherein, when calculating the editing distance, it usually needs two examination questions of statistics Full text between the number of operations of needs that mutually converts, calculating process is relative complex.Based on this, the present invention provides one kind Based on the important keyword in examination question as weight, and weighing edit distance is calculated to evaluate the clustering method of similarity, use Editing distance between shorter character string calculates to evaluate similarity, can simplify calculating process, and can be realized to efficient Extensive examination question is clustered, and be further able to using the method realize to examination question carry out duplicate removal.
The present invention provides a kind of clustering method of examination question, and Fig. 1 is the clustering method stream of examination question provided in an embodiment of the present invention Cheng Tu, as shown in Figure 1, the clustering method of examination question includes:
Step S101: cluster centre examination question is chosen in all examination questions for participating in cluster.
Optionally, can according to the creation time of examination question and the quality evaluation of examination question, to all examination questions for participating in cluster into Row is integrated ordered;Then it selects integrated ordered to be primary examination question as cluster centre examination question.Wherein, quality evaluation can be used The difficulty value and examination question discrimination of examination question are measured, and the difficulty value of examination question is average rate of the subject for the examination question, are led to It often needs in 0.2~0.8 this section, including endpoint value, if the higher grade of quality evaluation in this section. Examination question discrimination refers to examination question to the size of the resolution capability of subject's ' Current Knowledge Regarding, and examination question discrimination is higher, then quality The higher grade of evaluation.Integrated ordered to carry out with the creation time and quality evaluation of examination question, wherein quality evaluation is better than creation Time, for example, the then higher examination question sequence of quality evaluation rank is located further forward when examination question creation time is identical.
Step S102: it determines that the important key character of cluster centre examination question is denoted as the first character string, determines examination question to be clustered Important key character be denoted as the second character string, important key character is that can change examination question meaning after newly-increased, replacement or modification Or the character of type.
By taking following mathematics examination questions as an example: known vectorA=(cos3x/4, sin3x/4),B=(cos (π/3 x/4+) ,- sin(x/4+π/3));Enable f (x)=(a+B) f (x) analytic expression and monotonic increase section are asked in ^2, (1);(2) if x ∈ [- π/6, 5 π/6], find a function the maximum value and minimum value of f (x);(3) if f (x)=5/2, the value of (π/6 x-) sin is sought.
In above-mentioned examination question, " vector " and vector symbol " → " are significant for the knowledge point mark of topic, if newly Increase, important keyword as replacement and deletion, the meaning and classification of meeting significant modification topic.It therefore, will be such in the present invention Important keyword assigns bigger weight in weighing edit distance.
Optionally, important keyword character library is constructed using term frequency-inverse document frequency model in the present invention;With same subject A large amount of examination question be data basis (such as 1,000,000 problems), using term frequency-inverse document frequency model in a large amount of examination question Important keyword is picked out, the important keyword character library for covering all knowledge points in subject substantially is formed;Then according to important Keyword character library determines the second character string in the first character string and examination question to be clustered in cluster centre examination question.First character String and weight of second character string as weighing edit distance.Based on a large amount of examination questions, important pass is picked out according to model Key word can guarantee the accuracy of important keyword selection in clustering method, and then guarantee the accuracy of subsequent similarity calculation.
Step S103: calculating the weighing edit distance between the first character string and the second character string, and weighing edit distance is The least weighting operations number mutually converted between first character string and the second character string.
Optionally, in clustering method provided by the invention, the operation of weighing edit distance includes: insertion, deletion, replacement; Wherein, when calculating weighting operations number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice. The weight for the weighing edit distance that the present invention uses will affect examination question due to important keyword for the important keyword in examination question Meaning or type replace important keyword so being denoted as replacement and operating twice when calculating weighting operations number to increase Influence in number of operations promotes the accuracy of subsequent similarity calculation.
Step S104: calculating the similarity between examination question and cluster centre examination question to be clustered according to weighing edit distance, In, the calculation formula of similarity r are as follows:
R=(sum-dist)/sum, wherein sum is the length summation of the first character string and the second character string, and dist is Weighing edit distance;
Step S105: similarity is greater than the examination question to be clustered of preset threshold and cluster centre examination question is classified as same examination question Class.
Further, clustering method further include: when the similarity of examination question to be clustered is respectively less than or is equal to preset threshold, By cluster centre examination question separately as an examination question class.
Optionally, preset threshold 0.8.It, will be in examination question to be clustered and cluster when the similarity being calculated is greater than 0.8 Heart examination question is classified as same examination question class;When the similarity being calculated is less than or equal to 0.8, by examination question to be clustered and cluster Center examination question is divided into different examination question classes.
Step S102 is after having chosen cluster centre examination question, calculate to an examination question to be clustered similar to step 105 The process spent and sorted out can generate one after being performed both by step S102 to step 105 to all examination questions to be clustered with poly- Examination question cluster centered on the examination question of class center, examination question and the similarity of cluster centre examination question in this examination question cluster compared with It is high.
Further, after generating an examination question cluster centered on cluster centre examination question, can not return to remaining The examination question for entering the cluster carries out clustering next time.Optionally, continue one cluster centre examination of selection in remaining examination question Topic then proceedes to execute step S102 to step 105, ultimately generates another examination question cluster.And so on, until will be all Examination question to be clustered carries out clustering.
There is similarity to be higher than the examination question of preset threshold around each cluster centre examination question, it is poly- to form a similar topic Class, the data of examination question are different in the same similar size inscribed in cluster according to practical examination question data and preset threshold, cluster. Fig. 2 is the examination question cluster schematic diagram generated using clustering method provided by the invention.As shown in Fig. 2, similar topic cluster (a) and phase Different preset thresholds is used like topic cluster (b).Cluster centre examination question in similar topic cluster (a) is A, and similar topic clusters (b) In cluster centre examination question be H.Number is the similarity of examination question and cluster centre examination question in figure.
It should be noted that by taking similar topic clusters (a) as an example, identical examination question C and examination question P (phase with examination question A similarity It is that 0.9), the similarity between examination question C and examination question P is not necessarily high like degree, phase is not present in clustering method provided by the invention Like the transfer law of degree, the present invention only calculates the weighing edit distance of examination question to be clustered and cluster centre examination question and guarantees them With the similarity of cluster centre examination question, but do not calculate and guarantee weighing edit distance between these examination questions to be clustered and Similarity.
Further, before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes: unified Examination question format, wherein include: to comprising kinds of characters format or formula picture htm test question files carry out Classification and Identification and Context resolution is converted into latex examination question text;By latex examination question text conversion at can normal reading text formatting.
It participates in not only including text, figure, punctuation mark in the examination question of cluster, come particularly with the examination question of mathematic subject It says, further includes many complicated formula in mathematics examination question, these formula are probably derived from different formula editors, alternatively, respectively A examination question supplier sometimes can respectively submit the text formatting of oneself some definition, and there are many mistakes and noise in the inside, cause There is larger difficulty in comparison between examination question.In order to adequately be utilized to a large amount of examination question, in the present invention first To participate in cluster examination question carry out format unification, be converted into can normal reading text formatting, can be realized to it is all not With the utilization of format examination question.
Based on the same inventive concept, the present invention also provides a kind of De-weight method of examination question, Fig. 3 provides for the embodiment of the present invention Examination question De-weight method flow chart, as shown in Figure 3, comprising:
Step S301: the examination question in duplicate removal examination question group is treated using the clustering method of any one examination question provided by the invention Carry out clustering processing;
Step S302: the examination question for belonging to same examination question class with cluster centre examination question is deleted.
Based on the clustering method of examination question provided by the invention, the preset threshold used when by judging similarity is set Set, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities almost can be with It is determined as the repetition examination question compared with cluster centre examination question, will only retains cluster centre examination after the very high examination question removal of similarity Topic can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.
Based on the same inventive concept, the present invention also provides a kind of clustering system of examination question, Fig. 4 provides for the embodiment of the present invention Examination question clustering system block diagram one, as shown in Figure 4, comprising: cluster centre examination question choose module 11, important key character determine Module 12, weighing edit distance computing module 13, similarity calculation module 14 and examination question classifying module 15;Wherein,
Cluster centre examination question chooses module 11, is connected with important key character determining module 12, in all participations Cluster centre examination question is chosen in the examination question of cluster, and the cluster centre examination question of selection is sent to important key character determining module 12。
Optionally, can according to the creation time of examination question and the quality evaluation of examination question, to all examination questions for participating in cluster into Row is integrated ordered;Then it selects integrated ordered to be primary examination question as cluster centre examination question.Wherein, quality evaluation can be used The difficulty value and examination question discrimination of examination question are measured, and the difficulty value of examination question is average rate of the subject for the examination question, are led to It often needs in 0.2~0.8 this section, including endpoint value, if the higher grade of quality evaluation in this section. Examination question discrimination refers to examination question to the size of the resolution capability of subject's ' Current Knowledge Regarding, and examination question discrimination is higher, then quality The higher grade of evaluation.Integrated ordered to carry out with the creation time and quality evaluation of examination question, wherein quality evaluation is better than creation Time, for example, the then higher examination question sequence of quality evaluation rank is located further forward when examination question creation time is identical.
Important key character determining module 12, is connected with weighing edit distance computing module 13, for determining in cluster The important key character of heart examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second character String, and the first character string and the second character string are sent to weighing edit distance computing module 13, important key character be it is newly-increased, The character of examination question meaning or type can be changed after replacement or modification;
Weighing edit distance computing module 13 is connected with similarity calculation module 14, for calculate the first character string and Weighing edit distance between second character string, and weighing edit distance is sent to similarity calculation module 14, weighting editor The least weighting operations number that distance mutually converts between the first character string and the second character string.
Optionally, in clustering method provided by the invention, the operation of weighing edit distance includes: insertion, deletion, replacement; Wherein, when calculating weighting operations number: deletion is denoted as once-through operation, and insertion is denoted as once-through operation, and replacement is denoted as to be operated twice. The weight for the weighing edit distance that the present invention uses will affect examination question due to important keyword for the important keyword in examination question Meaning or type replace important keyword so being denoted as replacement and operating twice when calculating weighting operations number to increase Influence in number of operations promotes the accuracy of subsequent similarity calculation.
Similarity calculation module is connected 14, is connected with examination question classifying module 15, based on according to weighing edit distance The similarity between examination question and cluster centre examination question to be clustered is calculated, and similarity is sent to examination question classifying module 15, wherein phase Like the calculation formula of degree r are as follows: r=(sum-dist)/sum, wherein sum is total for the length of the first character string and the second character string With dist is weighing edit distance;
Examination question classifying module 15, the examination question to be clustered for similarity to be greater than preset threshold are classified as with cluster centre examination question Same examination question class.
In one embodiment, Fig. 5 is the clustering system block diagram two of examination question provided in an embodiment of the present invention, as shown in figure 5, The clustering system of examination question further includes uniform format module 16, uniform format module 16, with cluster centre examination question choose module 11 and Important key character determining module 12 is respectively connected with, for the htm examination question comprising kinds of characters format or formula picture File carries out Classification and Identification and Context resolution, is converted into latex examination question text;It is also used to latex examination question text conversion at can The text formatting of normal reading.It participates in the examination question of cluster including not only text, figure, punctuation mark, particularly with mathematics section It further include many complicated formula for purpose examination question, in mathematics examination question, these formula are probably derived from different equation editings Device, alternatively, each examination question supplier sometimes can respectively submit the text formattings of oneself some definition, the inside have many mistakes and Noise causes the comparison between examination question larger difficulty occur.In order to adequately be utilized to a large amount of examination question, this hair The unification for carrying out format in bright to the examination question for participating in cluster first, be converted into can normal reading text formatting, can be realized pair The utilization of all different-format examination questions.
Important key character determining module 12 further includes that character repertoire constructs module 121 and character string determining module 122, In, character repertoire constructs module 121, for constructing important keyword character library using term frequency-inverse document frequency technology;Character string is true Cover half block 122, for determining the first character string and the second character string according to important keyword character library.
Optionally, important keyword character library is constructed using term frequency-inverse document frequency model in the present invention;With same subject A large amount of examination question be data basis (such as 1,000,000 problems), using term frequency-inverse document frequency model in a large amount of examination question Important keyword is picked out, the important keyword character library for covering all knowledge points in subject substantially is formed;Then according to important Keyword character library determines the second character string in the first character string and examination question to be clustered in cluster centre examination question.First character String and weight of second character string as weighing edit distance.Based on a large amount of examination questions, important pass is picked out according to model Key word can guarantee the accuracy of important keyword selection in clustering method, and then guarantee the accuracy of subsequent similarity calculation.
Based on the same inventive concept, the present invention provides a kind of machining system of examination question, and Fig. 6 is provided in an embodiment of the present invention The machining system block diagram of examination question further includes examination as shown in fig. 6, including the clustering system of any one examination question provided by the invention Deduplication module 17 is inscribed, examination question deduplication module 17 is connected with examination question classifying module 15, for receiving the transmission of examination question classifying module Examination question classification results, and delete the examination question for belonging to same examination question class with cluster centre examination question.
Through the foregoing embodiment it is found that the clustering method of examination question provided by the invention, De-weight method and system, are at least realized It is following the utility model has the advantages that
(1) clustering method of examination question provided by the invention chooses cluster centre in all examination questions for participating in cluster first Examination question is then based on the important keyword in cluster centre examination question and examination question to be clustered as weight, calculates cluster centre examination question Weighing edit distance between the important keyword in examination question to be clustered, and then calculate cluster centre examination question and examination question to be clustered Similarity, Lai Shixian clustering, the editing distance between the shorter character string of use calculates to evaluate similarity, Neng Goujian Change calculating process, and can be realized and extensive examination question is efficiently clustered.
(2) clustering method based on examination question provided by the invention, the preset threshold used when by judging similarity into Row setting, can cluster out with the very high examination question of cluster centre examination question similarity, the very high examination question of these similarities is almost It can be determined that as the repetition examination question compared with cluster centre examination question, will only retain in cluster after the very high examination question removal of similarity Heart examination question can be realized using method provided by the invention and efficiently be carried out accurate duplicate removal to extensive examination question.
Although some specific embodiments of the invention are described in detail by example, the skill of this field Art personnel it should be understood that example above merely to being illustrated, the range being not intended to be limiting of the invention.The skill of this field Art personnel are it should be understood that can without departing from the scope and spirit of the present invention modify to above embodiments.This hair Bright range is defined by the following claims.

Claims (10)

1. a kind of clustering method of examination question characterized by comprising
Cluster centre examination question is chosen in all examination questions for participating in cluster;
It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important key of examination question to be clustered Character is denoted as the second character string, and the important key character is newly-increased, replacement or can change examination question meaning or class after modifying The character of type;
The weighing edit distance between first character string and second character string is calculated, the weighing edit distance is institute State the least weighting operations number mutually converted between the first character string and second character string;
The similarity between the examination question to be clustered and the cluster centre examination question is calculated according to the weighing edit distance, In, the calculation formula of similarity r are as follows:
R=(sum-dist)/sum, wherein sum is the length summation of first character string and second character string, dist For the weighing edit distance;
Similarity is greater than the examination question to be clustered of preset threshold and the cluster centre examination question is classified as same examination question class.
2. clustering method according to claim 1, which is characterized in that
Before the step of choosing cluster centre examination question in all examination questions for participating in cluster, further includes:
Unified examination question format, wherein include:
Classification and Identification and Context resolution are carried out to the htm test question files comprising kinds of characters format or formula picture, are converted into Latex examination question text;
By latex examination question text conversion at can normal reading text formatting.
3. clustering method according to claim 1, which is characterized in that
The step of choosing cluster centre examination question in all examination questions for participating in cluster specifically includes:
According to the quality evaluation of the creation time of examination question and examination question, all examination questions for participating in cluster are ranked up;
Selected and sorted is primary examination question as the cluster centre examination question.
4. clustering method according to claim 1, which is characterized in that
It determines that the important key character of the cluster centre examination question is denoted as the first character string, determines the important key of examination question to be clustered Character is denoted as the step of the second character string and includes:
Important keyword character library is constructed using term frequency-inverse document frequency model;
First character string and second character string are determined according to the important keyword character library.
5. clustering method according to claim 1, which is characterized in that
The operation of the weighing edit distance includes: insertion, deletion, replacement;Wherein, when calculating weighting operations number: deleting It is denoted as once-through operation, insertion is denoted as once-through operation, and replacement is denoted as to be operated twice.
6. a kind of De-weight method of examination question characterized by comprising
The examination question in duplicate removal examination question group is treated using the clustering method of examination question described in any one of claim 1 to 5 to be clustered Processing;
Delete the examination question for belonging to same examination question class with the cluster centre examination question.
7. a kind of clustering system of examination question characterized by comprising it is true that cluster centre examination question chooses module, important key character Cover half block, weighing edit distance computing module, similarity calculation module and examination question classifying module;Wherein,
The cluster centre examination question chooses module, is connected with the important key character determining module, in all participations Cluster centre examination question is chosen in the examination question of cluster, and the cluster centre examination question of selection is sent to the important key character Determining module;
The important key character determining module, is connected with the weighing edit distance computing module, described poly- for determining The important key character of class center examination question is denoted as the first character string, determines that the important key character of examination question to be clustered is denoted as the second word Symbol string, and first character string and second character string are sent to the weighing edit distance computing module, it is described heavy Key character is wanted to be newly-increased, replacement or the character of examination question meaning or type can be changed after modifying;
The weighing edit distance computing module, is connected with the similarity calculation module, for calculating first character Weighing edit distance between string and second character string, and the weighing edit distance is sent to the similarity calculation Module, the least weighting that the weighing edit distance mutually converts between first character string and second character string Number of operations;
The similarity calculation module is connected with examination question classifying module, for according to weighing edit distance calculating Similarity between examination question to be clustered and the cluster centre examination question, and the similarity is sent to the examination question and sorts out mould Block, wherein the calculation formula of similarity r are as follows: r=(sum-dist)/sum, wherein sum is first character string and described The length summation of second character string, dist are the weighing edit distance;
The examination question classifying module, the examination question to be clustered and the cluster centre for similarity to be greater than to preset threshold try Topic is classified as same examination question class.
8. clustering system according to claim 7, which is characterized in that
It further include uniform format module, the uniform format module chooses module and described important with the cluster centre examination question Key character determining module is respectively connected with, for the htm test question files comprising kinds of characters format or formula picture into Row Classification and Identification and Context resolution are converted into latex examination question text;It is also used to latex examination question text conversion at can normally read The text formatting of reading.
9. clustering system according to claim 7, which is characterized in that
The important key character determining module further includes character repertoire building module and character string determining module, wherein
The character repertoire constructs module, for constructing important keyword character library using term frequency-inverse document frequency technology;
The character string determining module, for determining first character string and described according to the important keyword character library Two character strings.
10. a kind of machining system of examination question, which is characterized in that the cluster system including the described in any item examination questions of claim 7 to 9 System, further includes examination question deduplication module, the examination question deduplication module is connected with the examination question classifying module, for receiving the examination The examination question classification results that classifying module is sent are inscribed, and delete the examination question for belonging to same examination question class with the cluster centre examination question.
CN201910680927.7A 2019-07-26 2019-07-26 A kind of clustering method of examination question, De-weight method and system Pending CN110390019A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910680927.7A CN110390019A (en) 2019-07-26 2019-07-26 A kind of clustering method of examination question, De-weight method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910680927.7A CN110390019A (en) 2019-07-26 2019-07-26 A kind of clustering method of examination question, De-weight method and system

Publications (1)

Publication Number Publication Date
CN110390019A true CN110390019A (en) 2019-10-29

Family

ID=68287599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910680927.7A Pending CN110390019A (en) 2019-07-26 2019-07-26 A kind of clustering method of examination question, De-weight method and system

Country Status (1)

Country Link
CN (1) CN110390019A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669181A (en) * 2020-12-29 2021-04-16 吉林工商学院 Assessment method for education practice training
CN112989058A (en) * 2021-05-10 2021-06-18 腾讯科技(深圳)有限公司 Information classification method, test question classification method, device, server and storage medium
WO2022170985A1 (en) * 2021-02-09 2022-08-18 广州视源电子科技股份有限公司 Exercise selection method and apparatus, and computer device and storage medium
CN118132733A (en) * 2024-05-07 2024-06-04 江西风向标智能科技有限公司 Test question retrieval method, system, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7062508B2 (en) * 2000-09-05 2006-06-13 Leonid Andreev Method and computer-based system for non-probabilistic hypothesis generation and verification
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN105373594A (en) * 2015-10-23 2016-03-02 广东小天才科技有限公司 Method and apparatus for screening repeated test questions from question bank
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness
CN108898170A (en) * 2018-06-19 2018-11-27 江苏中盈高科智能信息股份有限公司 A kind of intelligent Auto-generating Test Paper method based on fuzzy cluster analysis
CN109271401A (en) * 2018-09-26 2019-01-25 杭州大拿科技股份有限公司 Method, apparatus, electronic equipment and storage medium are corrected in a kind of search of topic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7062508B2 (en) * 2000-09-05 2006-06-13 Leonid Andreev Method and computer-based system for non-probabilistic hypothesis generation and verification
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN105373594A (en) * 2015-10-23 2016-03-02 广东小天才科技有限公司 Method and apparatus for screening repeated test questions from question bank
CN105824798A (en) * 2016-03-03 2016-08-03 云南电网有限责任公司教育培训评价中心 Examination question de-duplicating method of examination question base based on examination question key word likeness
CN108898170A (en) * 2018-06-19 2018-11-27 江苏中盈高科智能信息股份有限公司 A kind of intelligent Auto-generating Test Paper method based on fuzzy cluster analysis
CN109271401A (en) * 2018-09-26 2019-01-25 杭州大拿科技股份有限公司 Method, apparatus, electronic equipment and storage medium are corrected in a kind of search of topic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
疯狂的小猪: "Python Levenshtein 计算文本之间的距离", 《HTTPS://BLOG.CSDN.NET/U014657795/ARTICLE/DETAILS/90476489》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669181A (en) * 2020-12-29 2021-04-16 吉林工商学院 Assessment method for education practice training
CN112669181B (en) * 2020-12-29 2023-06-30 吉林工商学院 Assessment method for education practice training
WO2022170985A1 (en) * 2021-02-09 2022-08-18 广州视源电子科技股份有限公司 Exercise selection method and apparatus, and computer device and storage medium
CN112989058A (en) * 2021-05-10 2021-06-18 腾讯科技(深圳)有限公司 Information classification method, test question classification method, device, server and storage medium
CN118132733A (en) * 2024-05-07 2024-06-04 江西风向标智能科技有限公司 Test question retrieval method, system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110390019A (en) A kind of clustering method of examination question, De-weight method and system
Huang et al. Automating intention mining
Wartena et al. Topic detection by clustering keywords
US20170116203A1 (en) Method of automated discovery of topic relatedness
US9305083B2 (en) Author disambiguation
Felix et al. The exploratory labeling assistant: Mixed-initiative label curation with large document collections
CN110276456A (en) A kind of machine learning model auxiliary construction method, system, equipment and medium
Srikanth et al. Extractive text summarization using dynamic clustering and co-reference on BERT
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN108681564A (en) The determination method, apparatus and computer readable storage medium of keyword and answer
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
CN103514279A (en) Method and device for classifying sentence level emotion
JP2006190229A (en) Opinion extraction learning device and opinion extraction classifying device
CN117217315A (en) Method and device for generating high-quality question-answer data by using large language model
Alrasheed Word synonym relationships for text analysis: A graph-based approach
Fauzan et al. A novel approach to automated behavioral diagram assessment using label similarity and subgraph edit distance
Amarasinghe et al. Generative pre-trained transformers for coding text data? An analysis with classroom orchestration data
Vu et al. Revising FUNSD dataset for key-value detection in document images
CN109240549B (en) Calligraphy corrector based on external digital equipment and big data intelligent analysis
Zuin et al. Automatic tag recommendation for painting artworks using diachronic descriptions
Lubis et al. Improving course review helpfulness Prediction through sentiment analysis
CN113901793A (en) Event extraction method and device combining RPA and AI
Filzmoser et al. What computers can tell us about emotions–classification of affective communication in electronic negotiations by supervised machine learning
Setzu et al. Explainable authorship identification in cultural heritage applications: Analysis of a new perspective
Reiki et al. Comparison of Term Weighting Methods in Sentiment Analysis of the New State Capital of Indonesia with the SVM Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191029

RJ01 Rejection of invention patent application after publication