CN111488728A - Labeling method, device and storage medium for unstructured test question data - Google Patents
Labeling method, device and storage medium for unstructured test question data Download PDFInfo
- Publication number
- CN111488728A CN111488728A CN202010169346.XA CN202010169346A CN111488728A CN 111488728 A CN111488728 A CN 111488728A CN 202010169346 A CN202010169346 A CN 202010169346A CN 111488728 A CN111488728 A CN 111488728A
- Authority
- CN
- China
- Prior art keywords
- labeling
- test question
- data
- unstructured
- question data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 86
- 238000002372 labelling Methods 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 claims abstract description 59
- 238000013135 deep learning Methods 0.000 claims abstract description 22
- 239000011159 matrix material Substances 0.000 claims abstract description 16
- 230000007704 transition Effects 0.000 claims abstract description 15
- 238000005070 sampling Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000012937 correction Methods 0.000 claims abstract description 5
- 230000015654 memory Effects 0.000 claims description 16
- 238000012706 support-vector machine Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 7
- 208000025174 PANDAS Diseases 0.000 description 4
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 4
- 240000004718 Panda Species 0.000 description 4
- 235000016496 Panda oleosa Nutrition 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000003550 marker Substances 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 102100021218 Dual oxidase 1 Human genes 0.000 description 1
- 101100278661 Homo sapiens DUOX1 gene Proteins 0.000 description 1
- 241001074085 Scophthalmus aquosus Species 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/143—Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a labeling method, equipment and a storage medium of unstructured test question data, wherein the method comprises the following steps: extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing; inputting the preprocessed plurality of label-free data into a deep learning network, and inputting labeled data into the deep learning network for correction; respectively inputting output results of the deep learning network into at least two different types of base classifiers for integrated learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner; and constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix and generating a labeling result. The invention improves the accuracy of labeling, effectively solves the problem of labeling unstructured data such as test paper, and on the basis, can solve the problem of automatic warehousing of unstructured texts, and can save a large amount of manual labor.
Description
Technical Field
The invention relates to the technical field of education informatization, in particular to a method, equipment and a storage medium for labeling unstructured test question data.
Background
With the rapid development of artificial intelligence, more and more data are needed to be labeled. For example, if the artificial intelligence robot wants to recognize whether a photo is a panda or not, a batch of photos containing the panda needs to be collected first, then the batch of photos are labeled manually, the labeled content is the photos of the panda, and the labeled photos are not the photos of the panda, and then the labeled content and photo data are fed to an artificial intelligence program to carry out deep learning.
Structuring data: structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. The general characteristic is that data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. Such as identification number, name, age, gender, etc.
The unstructured data refers to data which are irregular or incomplete in data structure, have no predefined data model and are inconvenient to express by a database two-dimensional logic table, such as word documents, texts, pictures, XM L, HTM L, various reports, images, audio/video information and the like.
In general, when a new unstructured data is obtained, such as a word-format test question or a text-format course-related material, resources need to be put into a library, which is mainly performed in the following manner:
firstly, an experienced teacher makes relevant test questions and test papers, and the relevant test questions and test papers are generally originally made in a word; after the test questions are made, the text is input into the application system from the word, generally by manual operation, and related contents such as titles, answers, analysis and the like of the test questions need to be manually input, and related contents such as subjects, grades, school dates, teaching material versions and knowledge points are simultaneously selected, as shown in fig. 1.
However, this method has the following disadvantages:
(1) the number of structured test question resources is small, and the number of unstructured test question resources is large. The structured test question resource refers to that a question stem, an answer and analysis are respectively stored in a database so as to be convenient for calling an application program, an android or IOS APP or a front-end web page.
(2) A large amount of manual labor is easy to make mistakes, and the cost is high. When a word document or a text document is put into a warehouse, in the prior art, a test question resource is copied to a warehouse application program in a manual mode and then is put into the warehouse through the application program. However, this requires a lot of manpower, is inefficient, costly and prone to errors.
(3) In the aspect of data marking, the existing natural language processing application mainly marks words, and currently, a Stanford marker, an N-element marker, a Brill marker and the like are commonly used. However, such annotators are mainly labels for words, such as which word is a noun, which word is a verb, which are numbers, and so on. The text of the test paper requires marking where the question stem is, where the question is, where the option is, where the answer is, the text length is long, and no better solution exists at present.
Therefore, how to utilize the artificial intelligence technology to quickly convert the existing unstructured test question resources into structured test question resources is a technical problem to be solved by the application.
Disclosure of Invention
The present invention is directed to solve at least one of the technical problems in the prior art, and provides a method, an apparatus, and a storage medium for labeling unstructured test question data.
The embodiment of the first aspect of the invention provides a method for labeling unstructured test question data, which comprises the following steps:
extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing;
inputting the preprocessed plurality of label-free data into a deep learning network, and inputting labeled data into the deep learning network for correction;
respectively inputting output results of the deep learning network into at least two different types of base classifiers for ensemble learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner;
and constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix and generating a labeling result.
According to an embodiment of the first aspect of the present invention, the deep learning network is a Bi L STM network.
According to an embodiment of the first aspect of the invention, the base classifier comprises: a base classifier based on conditional random fields, a base classifier based on a structured support vector machine, and a base classifier based on a maximum space markov network.
According to an embodiment of the first aspect of the invention, the transition probability matrix is solved using a viterbi algorithm.
According to an embodiment of the first aspect of the invention, the pre-processing comprises: extracting test question texts, segmenting the test question texts and vectorizing the test question texts.
An embodiment of a second aspect of the present invention provides an apparatus for labeling unstructured test question data, including: at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of annotating unstructured test question data as described above.
In a third aspect, the present invention provides a computer-readable storage medium, which stores computer-executable instructions for causing a computer to execute an annotation method of unstructured test question data as described above.
The labeling method, the labeling equipment and the storage medium of the unstructured test question data provided by the embodiment of the invention have the following beneficial effects:
the method provides a new combined model, combines a deep learning network and at least two different types of base classifiers, and improves the accuracy of labeling by combination; the method effectively solves the problem of labeling unstructured data such as test paper and the like, and on the basis, the method can solve the problem of automatic warehousing of unstructured texts and save a large amount of manual labor; the method of the invention is different from the ordinary word marking in that the marking rule can directly mark the original data, while the ordinary word marking method generally marks the original data by using a dictionary format and needs to rearrange and combine the data, and the marking method can be more rapid and efficient when the early-stage manual marking is carried out.
Further features and advantages realized by the embodiments of the present disclosure will be set forth in the detailed description or may be learned by the practice of the embodiments.
Drawings
The invention is further described below with reference to the accompanying drawings and examples;
FIG. 1 is a prior art data entry interface;
fig. 2 is a schematic flowchart of a method for labeling unstructured test question data according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a BRNN network structure according to an embodiment of the present invention;
fig. 4 is a data processing flow of a labeling method for unstructured test question data according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a training network structure of a labeling method for unstructured test question data according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a labeling apparatus for unstructured test question data according to an embodiment of the present invention.
Detailed Description
The technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure. It should be noted that the features of the embodiments and examples of the present disclosure may be combined with each other without conflict. In addition, the purpose of the drawings is to graphically supplement the description in the written portion of the specification so that a person can intuitively and visually understand each technical feature and the whole technical solution of the present disclosure, but it should not be construed as limiting the scope of the present disclosure.
Referring to fig. 2 and 3, an embodiment of the present invention provides a labeling method for unstructured test question data, including the following steps:
s100, extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing;
in this embodiment, the data in the test question data set generally includes a test question header part, a picture, a font and a test question stem part, so that the data is preprocessed, where the preprocessing specifically includes: extracting test question texts, segmenting the test question texts and vectorizing the test question texts. The step of extracting the test question text specifically comprises the following steps: after extracting the text through Python-docx, removing related information such as titles, pictures, fonts and the like, and only keeping related parts of the test questions; the test question text segmentation method specifically comprises the following steps: segmenting text contents according to paragraph symbols through a regular rule, and dividing a line with a plurality of answers into a plurality of answers through the regular rule; the test question text vectorization step specifically comprises: and (3) performing a concatenation (concat) operation on the vector of each word on the segmented text to convert a segment of characters into a text vector.
S200, inputting the preprocessed plurality of label-free data into a deep learning network, and inputting the labeled data into the deep learning network for correction;
the deep learning network here can be chosen as a Bi L STM (Bi-directional L ong Short-term memory) network, a Bi L STM network is formed by combining a forward L STM (long and Short term memory network) and a backward L STM, the L STM model can better capture the dependency relationship of longer distance, because L STM can learn which information is memorized and which information is forgotten through a training process, but the output of the next moment can be predicted only according to the timing information of the previous moment by using RNN (recurrent neural network) and L STM, but in some problems, the output of the current moment is not only related to the previous state but also possibly related to the future state, for example, the prediction of a missing word in a sentence needs to be judged not only according to the foregoing, but also needs to consider the content behind it, really does the judgment based on the context.
S300, respectively inputting output results of the deep learning network into at least two different types of base classifiers for integrated learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner;
wherein, the strong learner can be integrated by combining the same type of weak learner with the strategy, and the weak learner and the strong learner are well known in the art and will not be described in detail herein. The base classifier herein may include, but is not limited to: at least two of the base classifiers can be selected for effective fusion based on the base classifier of the conditional random field, the base classifier based on the structured support vector machine and the base classifier based on the maximum space Markov network, and the three base classifiers are preferably used in the embodiment.
In the field, the labeling algorithm based on the neural network model has some defects, when the data scale is small, the performance of the traditional labeling algorithm is slightly superior to that of the labeling algorithm based on the neural network model, and when the dependence requirement of the identified task on long-distance information is not high, the complexity of the model is only increased by using the labeling algorithm based on the neural network, but the performance cannot be effectively improved; base classifiers also have a number of disadvantages, such as: the hidden Markov model only depends on each state and the observation phenomenon corresponding to the state, and the target function is not matched with the prediction function; while conditional random field models solve the problem of label bias, the models become correspondingly complex. Therefore, in order to effectively alleviate the above problems, the method effectively fuses the deep learning and the sequence labeling algorithm of at least two different types of classifiers through steps S200 and S300, so as to improve the accuracy of labeling.
S400, constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix, and generating a labeling result.
The transition probability matrix can be solved by softmax (logistic regression) algorithm or viterbi (viterbi) algorithm, and the viterbi algorithm with lower algorithm complexity can be optimized.
The method of this embodiment adopts a Bagging (Bootstrap aggregation) method, for example: in step S100, firstly, n unlabelled data are extracted from the test question data set according to a self-service sampling method (bootstrapping), the n unlabelled data of one round can be used as a training set, and k training sets are obtained after k rounds of extraction are performed; because random sampling is adopted, each training set is different and independent from each other, then in steps S200 and S300, the training sets are sent into a deep learning network for training and then sent into a base classifier for learning, in each round of random sampling process, some data in the test data set may be repeatedly sampled, some data may not be sampled, the data which is not sampled does not participate in fitting of a training model, the data which is not sampled in the part can be used for detecting generalization capability of the model, and each weak learner in each base classifier can adopt a voting mode to obtain an optimal value, so that a final labeling effect is guaranteed; and finally, obtaining all the prediction output values of each strong learner by adopting a combination strategy of arithmetic mean, then generating a transition probability matrix, wherein all elements of the transition probability matrix are non-negative, the sum of all the elements in each row is equal to 1, all the elements are represented by probability, and solving the transition probability matrix can obtain the probability of which category a certain text segment belongs, namely a labeling result.
The method provides a new combined model, combines a deep learning network and at least two different types of base classifiers, and improves the accuracy of labeling by combination;
the method effectively solves the problem of labeling unstructured data such as test paper, and on the basis, the method can solve the problem of automatic storage of unstructured texts and save a large amount of manual labor.
The method of this embodiment is different from the ordinary word labeling in that the original data can be directly labeled by the labeling rule, while the ordinary word labeling method generally uses a dictionary format for labeling and needs to rearrange and combine the data.
Referring to fig. 4 and 5, an embodiment of the present invention provides a labeling method for unstructured test question data.
As shown in fig. 4, the original data is divided into three parts, wherein the first part is unmarked data sampled by a self-service sampling method; the second part is marked data which is marked manually; the third part is unmarked data to be detected; training the data of the first part and the second part, generating a model, and finally realizing automatic labeling of the data of the third part (unstructured text data) through the model.
As shown in FIG. 5, when a training model is constructed, a deep learning algorithm Bi L STM is used, a Conditional Random Field (CRF), a maximum interval Markov network (M3N), a structured Support Vector Machine (SVM) and a Bagging method are combined, and finally a Viterbi algorithm is used to obtain a final multi-classification result, namely, which category each text segment belongs to is obtained.
The method comprises the following specific processes:
firstly, after vectorization, inputting unmarked data into a Bi L STM network, using the marked data for correction, then, respectively sending output results of the Bi L STM network as input into a CRF weak learner, an SVM weak learner and an M3N weak learner, wherein the three base classifiers are respectively used for integrated learning, because a self-sampling method is adopted to process the input data and used for learning of different weak learners, each sample set of random sampling is different, so that different weak learners can be obtained, the weak learners of the same type can form corresponding strong learners after being integrated by a combination strategy, in each round of random sampling process, data which are not sampled do not participate in fitting of a training model, the part of Viterbi data can be used for detecting generalization capability of the model, finally, the combination strategy is an arithmetic mean to obtain all predicted output values of each strong learner, all predicted output values are generated into a transition probability matrix, elements in the matrix are all nonnegative, the sum of the elements is equal to 1, each element is represented by a probability, and a text is used for obtaining a final segmentation probability result, namely, and a segmentation algorithm is used for solving a certain class.
The new combination model provided by the method combines a Bi L STM deep learning network, a conditional random field, a maximum interval Markov network and a structured support vector machine, and improves the accuracy of labeling by combination.
The method effectively solves the problem of marking unstructured data such as test paper and the like, and on the basis, the method can solve the problem of automatic storage of unstructured texts and save a large amount of manual labor.
The method of this embodiment is different from the ordinary word labeling in that the original data can be directly labeled by the labeling rule, and the ordinary word labeling method generally uses a dictionary format for labeling and requires rearrangement and combination of data. By the method, the manual marking at the early stage can be carried out more quickly and efficiently.
The test question labeling rules and labeling examples of any of the above embodiments are provided below:
firstly, labeling rules:
the labeling of test questions and test papers aims at realizing the labeling of each component of the test papers, and the labeling method is different from the common word labeling method in that the labeling can be directly performed on original data through the labeling rule, while the common word labeling method generally uses dictionary format recording and needs to rearrange the data.
Beginning of the test paper title: [ SJBT ]
The test paper title ends: [/SJBT ]
Start of test paper description: [ SJSM ]
And (5) ending the test paper description: [/SJSM ]
Beginning with the description of the types: [ TX ]
End of topic description: [/TX ]
Beginning with the stem: [ TG ]
And (4) ending the stem: [/TG ]
The problem starts: [ WT ]
And (4) ending the problem: [/WT ]
Description of points of knowledge start: [ ZSD ]
Description of points of knowledge start: [/ZSD ]
The 1 st choice question option begins: [ XZ1]
The 1 st choice question option ends: [/XZ1]
The 2 nd choice question option begins: [ XZ2]
The 2 nd choice question option ends: [/XZ2]
.., and so on
The answer starts: [ DA ]
And (5) answer ending: [/DA ]
Analysis is started: [ JX ]
And (5) finishing analysis: [/JX ]
Other descriptions:
beginning of test questions: [ ST ]
And (4) ending the test question: [/ST ]
Examination question type description:
single choice questions: [ DANX/]
Multiple choice questions: [ DUOX/]
Question judgment: [ PDT/]
Filling in the blank: [ TKT/]
Simple answering: [ JDT/]
Completing shape filling: [ WXTK/]
Read to understand [ YD L J/].
Secondly, marking an example:
(1) original test paper content example:
test questions of 3 grades of a primary school in a certain city:
firstly, selecting questions.
1. Which are odd?
A、4B、5C、8
And (3) answer: b is
And (3) analysis: the number that cannot be divided exactly by 2 is odd;
(2) example of a marked test paper:
[ SJBT ] test question of grade 3 of a certain primary school in certain City: [/SJBT ]
[ TX ] I, choose topic. [/TX ]
[ST]
[DANX/]
[ ZSD ] odd and even [/ZSD ]
[ WT ]1, which are odd? [/WT ]
[XZ1]A、4[/XZ1][XZ2]B、5[/XZ2][XZ3]C、8[/XZ3]
[ DA ] answer: b [/DA ]
[ JX ] analysis: the number not divisible by 2 being odd [/JX ]
[/ST]。
Referring to fig. 6, an embodiment of the present invention further provides a labeling device for unstructured test question data, where the labeling device for unstructured test question data may be any type of intelligent terminal, such as a mobile phone, a tablet computer, a personal computer, and the like.
Specifically, the device for labeling unstructured test question data comprises: one or more control processors and memory, one control processor being exemplified in fig. 6. The control processor and the memory may be connected by a bus or other means, as exemplified by the bus connection in fig. 6.
The memory is used as a non-transitory computer readable storage medium, and can be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to the labeling device of the unstructured test question data in the embodiment of the present invention, and the control processor implements the method for labeling the unstructured test question data in the above method embodiment by running the non-transitory software program, instruction, and module stored in the memory.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store the generated data. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes a memory remotely located from the control processor, and the remote memories may be connected to the annotating device of the unstructured test question data via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory and, when executed by the one or more control processors, perform the method of annotating unstructured test question data in the above-described method embodiments, e.g., performing the method steps S100-S400 in fig. 2 described above.
Embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions, which are executed by one or more control processors, for example, by one of the control processors in fig. 6, and can cause the one or more control processors to execute the method for labeling unstructured test question data in the above method embodiments, for example, to execute the above-described method steps S100-S400 in fig. 2.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art can clearly understand that the embodiments can be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.
Claims (7)
1. A labeling method for unstructured test question data is characterized by comprising the following steps:
extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing;
inputting the preprocessed plurality of label-free data into a deep learning network, and inputting labeled data into the deep learning network for correction;
respectively inputting output results of the deep learning network into at least two different types of base classifiers for ensemble learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner;
and constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix and generating a labeling result.
2. The method for labeling unstructured test question data according to claim 1, characterized in that the deep learning network is a Bi L STM network.
3. The method for labeling unstructured test question data according to claim 1 or 2, characterized in that the base classifier comprises: a base classifier based on conditional random fields, a base classifier based on a structured support vector machine, and a base classifier based on a maximum space markov network.
4. The method for labeling unstructured test question data according to claim 3, characterized in that the transition probability matrix is solved by using a viterbi algorithm.
5. The method for labeling unstructured test question data according to claim 3, characterized in that the preprocessing comprises: extracting test question texts, segmenting the test question texts and vectorizing the test question texts.
6. An apparatus for labeling unstructured test question data, comprising: at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of annotating unstructured test question data according to any one of claims 1 to 5.
7. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of labeling unstructured test question data according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010169346.XA CN111488728A (en) | 2020-03-12 | 2020-03-12 | Labeling method, device and storage medium for unstructured test question data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010169346.XA CN111488728A (en) | 2020-03-12 | 2020-03-12 | Labeling method, device and storage medium for unstructured test question data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111488728A true CN111488728A (en) | 2020-08-04 |
Family
ID=71797632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010169346.XA Pending CN111488728A (en) | 2020-03-12 | 2020-03-12 | Labeling method, device and storage medium for unstructured test question data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111488728A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112015903A (en) * | 2020-10-22 | 2020-12-01 | 广州华多网络科技有限公司 | Question duplication judging method and device, storage medium and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180114142A1 (en) * | 2016-10-26 | 2018-04-26 | Swiss Reinsurance Company Ltd. | Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof |
CN109408812A (en) * | 2018-09-30 | 2019-03-01 | 北京工业大学 | A method of the sequence labelling joint based on attention mechanism extracts entity relationship |
US10380236B1 (en) * | 2017-09-22 | 2019-08-13 | Amazon Technologies, Inc. | Machine learning system for annotating unstructured text |
-
2020
- 2020-03-12 CN CN202010169346.XA patent/CN111488728A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180114142A1 (en) * | 2016-10-26 | 2018-04-26 | Swiss Reinsurance Company Ltd. | Data extraction engine for structured, semi-structured and unstructured data with automated labeling and classification of data patterns or data elements therein, and corresponding method thereof |
US10380236B1 (en) * | 2017-09-22 | 2019-08-13 | Amazon Technologies, Inc. | Machine learning system for annotating unstructured text |
CN109408812A (en) * | 2018-09-30 | 2019-03-01 | 北京工业大学 | A method of the sequence labelling joint based on attention mechanism extracts entity relationship |
Non-Patent Citations (10)
Title |
---|
吴嘉乐: ""异质集成学习器在鸢尾花卉分类中的应用"", 《中国设备工程》 * |
吴晓欢;付金强;周训海;彭刚;陈敏;阮景;: "基于工业数据的生产过程故障预测方法研究", 智慧工厂, no. 11 * |
方浩等: "一种基于语义关系改进的隐马尔可夫模型研究", 《通信技术》 * |
方浩等: "一种基于语义关系改进的隐马尔可夫模型研究", 《通信技术》, no. 05, 10 May 2008 (2008-05-10) * |
朱鹏飞: ""基于Bi-LSTM的汉语自动语义角色标注研究"", 《中国优秀硕士学位论文全文数据库》 * |
朱鹏飞: ""基于Bi-LSTM的汉语自动语义角色标注研究"", 《中国优秀硕士学位论文全文数据库》, 15 August 2019 (2019-08-15), pages 13 - 14 * |
郭崇慧等: "一种基于集成学习的试题多知识点标注方法", 《运筹与管理》 * |
郭崇慧等: "一种基于集成学习的试题多知识点标注方法", 《运筹与管理》, no. 02, 25 February 2020 (2020-02-25) * |
金志刚: ""一种结合深度学习和集成学习的情感分析模型"", 《哈尔滨工业大学学报》 * |
陶嘉栋: ""基于Bagging与超限学习机的脑力负荷识别模型"", 《软件导刊》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112015903A (en) * | 2020-10-22 | 2020-12-01 | 广州华多网络科技有限公司 | Question duplication judging method and device, storage medium and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11508251B2 (en) | Method and system for intelligent identification and correction of questions | |
WO2018032937A1 (en) | Method and apparatus for classifying text information | |
CN108121702B (en) | Method and system for evaluating and reading mathematical subjective questions | |
Wilkinson et al. | Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections | |
CN112115301B (en) | Video annotation method and system based on classroom notes | |
CN110781648A (en) | Test paper automatic transcription system and method based on deep learning | |
CN112597773B (en) | Document structuring method, system, terminal and medium | |
CN112559781B (en) | Image retrieval system and method | |
CN111610901B (en) | AI vision-based English lesson auxiliary teaching method and system | |
CN113779227B (en) | Case fact extraction method, system, device and medium | |
CN114638988A (en) | Teaching video automatic classification method and system based on different presentation modes | |
CN113505786A (en) | Test question photographing and judging method and device and electronic equipment | |
CN113610068B (en) | Test question disassembling method, system, storage medium and equipment based on test paper image | |
CN113779345B (en) | Teaching material generation method and device, computer equipment and storage medium | |
CN114780723A (en) | Portrait generation method, system and medium based on guide network text classification | |
CN113569112A (en) | Tutoring strategy providing method, system, device and medium based on question | |
Nguyen et al. | Handwriting recognition and automatic scoring for descriptive answers in Japanese language tests | |
CN111488728A (en) | Labeling method, device and storage medium for unstructured test question data | |
CN110941976A (en) | Student classroom behavior identification method based on convolutional neural network | |
CN117454987A (en) | Mine event knowledge graph construction method and device based on event automatic extraction | |
CN112966518A (en) | High-quality answer identification method for large-scale online learning platform | |
CN116383354A (en) | Automatic visual question-answering method based on knowledge graph | |
CN107992482B (en) | Protocol method and system for solving steps of mathematic subjective questions | |
CN114372128B (en) | Automatic solving method and system for rotational symmetry type geometric volume problem | |
CN114331932A (en) | Target image generation method and device, computing equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200804 |
|
RJ01 | Rejection of invention patent application after publication |