CN111488728A

CN111488728A - Labeling method, device and storage medium for unstructured test question data

Info

Publication number: CN111488728A
Application number: CN202010169346.XA
Authority: CN
Inventors: 杨颂
Original assignee: Tianwen Digital Media Technology Beijing Co ltd
Current assignee: Tianwen Digital Media Technology Beijing Co ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2020-08-04

Abstract

The invention discloses a labeling method, equipment and a storage medium of unstructured test question data, wherein the method comprises the following steps: extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing; inputting the preprocessed plurality of label-free data into a deep learning network, and inputting labeled data into the deep learning network for correction; respectively inputting output results of the deep learning network into at least two different types of base classifiers for integrated learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner; and constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix and generating a labeling result. The invention improves the accuracy of labeling, effectively solves the problem of labeling unstructured data such as test paper, and on the basis, can solve the problem of automatic warehousing of unstructured texts, and can save a large amount of manual labor.

Description

Labeling method, device and storage medium for unstructured test question data

Technical Field

The invention relates to the technical field of education informatization, in particular to a method, equipment and a storage medium for labeling unstructured test question data.

Background

With the rapid development of artificial intelligence, more and more data are needed to be labeled. For example, if the artificial intelligence robot wants to recognize whether a photo is a panda or not, a batch of photos containing the panda needs to be collected first, then the batch of photos are labeled manually, the labeled content is the photos of the panda, and the labeled photos are not the photos of the panda, and then the labeled content and photo data are fed to an artificial intelligence program to carry out deep learning.

Structuring data: structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. The general characteristic is that data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. Such as identification number, name, age, gender, etc.

The unstructured data refers to data which are irregular or incomplete in data structure, have no predefined data model and are inconvenient to express by a database two-dimensional logic table, such as word documents, texts, pictures, XM L, HTM L, various reports, images, audio/video information and the like.

In general, when a new unstructured data is obtained, such as a word-format test question or a text-format course-related material, resources need to be put into a library, which is mainly performed in the following manner:

firstly, an experienced teacher makes relevant test questions and test papers, and the relevant test questions and test papers are generally originally made in a word; after the test questions are made, the text is input into the application system from the word, generally by manual operation, and related contents such as titles, answers, analysis and the like of the test questions need to be manually input, and related contents such as subjects, grades, school dates, teaching material versions and knowledge points are simultaneously selected, as shown in fig. 1.

However, this method has the following disadvantages:

(1) the number of structured test question resources is small, and the number of unstructured test question resources is large. The structured test question resource refers to that a question stem, an answer and analysis are respectively stored in a database so as to be convenient for calling an application program, an android or IOS APP or a front-end web page.

(2) A large amount of manual labor is easy to make mistakes, and the cost is high. When a word document or a text document is put into a warehouse, in the prior art, a test question resource is copied to a warehouse application program in a manual mode and then is put into the warehouse through the application program. However, this requires a lot of manpower, is inefficient, costly and prone to errors.

(3) In the aspect of data marking, the existing natural language processing application mainly marks words, and currently, a Stanford marker, an N-element marker, a Brill marker and the like are commonly used. However, such annotators are mainly labels for words, such as which word is a noun, which word is a verb, which are numbers, and so on. The text of the test paper requires marking where the question stem is, where the question is, where the option is, where the answer is, the text length is long, and no better solution exists at present.

Therefore, how to utilize the artificial intelligence technology to quickly convert the existing unstructured test question resources into structured test question resources is a technical problem to be solved by the application.

Disclosure of Invention

The present invention is directed to solve at least one of the technical problems in the prior art, and provides a method, an apparatus, and a storage medium for labeling unstructured test question data.

The embodiment of the first aspect of the invention provides a method for labeling unstructured test question data, which comprises the following steps:

extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing;

inputting the preprocessed plurality of label-free data into a deep learning network, and inputting labeled data into the deep learning network for correction;

respectively inputting output results of the deep learning network into at least two different types of base classifiers for ensemble learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner;

and constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix and generating a labeling result.

According to an embodiment of the first aspect of the present invention, the deep learning network is a Bi L STM network.

According to an embodiment of the first aspect of the invention, the base classifier comprises: a base classifier based on conditional random fields, a base classifier based on a structured support vector machine, and a base classifier based on a maximum space markov network.

According to an embodiment of the first aspect of the invention, the transition probability matrix is solved using a viterbi algorithm.

According to an embodiment of the first aspect of the invention, the pre-processing comprises: extracting test question texts, segmenting the test question texts and vectorizing the test question texts.

An embodiment of a second aspect of the present invention provides an apparatus for labeling unstructured test question data, including: at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of annotating unstructured test question data as described above.

In a third aspect, the present invention provides a computer-readable storage medium, which stores computer-executable instructions for causing a computer to execute an annotation method of unstructured test question data as described above.

The labeling method, the labeling equipment and the storage medium of the unstructured test question data provided by the embodiment of the invention have the following beneficial effects:

the method provides a new combined model, combines a deep learning network and at least two different types of base classifiers, and improves the accuracy of labeling by combination; the method effectively solves the problem of labeling unstructured data such as test paper and the like, and on the basis, the method can solve the problem of automatic warehousing of unstructured texts and save a large amount of manual labor; the method of the invention is different from the ordinary word marking in that the marking rule can directly mark the original data, while the ordinary word marking method generally marks the original data by using a dictionary format and needs to rearrange and combine the data, and the marking method can be more rapid and efficient when the early-stage manual marking is carried out.

Further features and advantages realized by the embodiments of the present disclosure will be set forth in the detailed description or may be learned by the practice of the embodiments.

Drawings

The invention is further described below with reference to the accompanying drawings and examples;

FIG. 1 is a prior art data entry interface;

fig. 2 is a schematic flowchart of a method for labeling unstructured test question data according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a BRNN network structure according to an embodiment of the present invention;

fig. 4 is a data processing flow of a labeling method for unstructured test question data according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training network structure of a labeling method for unstructured test question data according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a labeling apparatus for unstructured test question data according to an embodiment of the present invention.

Detailed Description

The technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure. It should be noted that the features of the embodiments and examples of the present disclosure may be combined with each other without conflict. In addition, the purpose of the drawings is to graphically supplement the description in the written portion of the specification so that a person can intuitively and visually understand each technical feature and the whole technical solution of the present disclosure, but it should not be construed as limiting the scope of the present disclosure.

Referring to fig. 2 and 3, an embodiment of the present invention provides a labeling method for unstructured test question data, including the following steps:

s100, extracting a plurality of unmarked data from the test question data set according to a self-service sampling method for preprocessing;

in this embodiment, the data in the test question data set generally includes a test question header part, a picture, a font and a test question stem part, so that the data is preprocessed, where the preprocessing specifically includes: extracting test question texts, segmenting the test question texts and vectorizing the test question texts. The step of extracting the test question text specifically comprises the following steps: after extracting the text through Python-docx, removing related information such as titles, pictures, fonts and the like, and only keeping related parts of the test questions; the test question text segmentation method specifically comprises the following steps: segmenting text contents according to paragraph symbols through a regular rule, and dividing a line with a plurality of answers into a plurality of answers through the regular rule; the test question text vectorization step specifically comprises: and (3) performing a concatenation (concat) operation on the vector of each word on the segmented text to convert a segment of characters into a text vector.

S200, inputting the preprocessed plurality of label-free data into a deep learning network, and inputting the labeled data into the deep learning network for correction;

the deep learning network here can be chosen as a Bi L STM (Bi-directional L ong Short-term memory) network, a Bi L STM network is formed by combining a forward L STM (long and Short term memory network) and a backward L STM, the L STM model can better capture the dependency relationship of longer distance, because L STM can learn which information is memorized and which information is forgotten through a training process, but the output of the next moment can be predicted only according to the timing information of the previous moment by using RNN (recurrent neural network) and L STM, but in some problems, the output of the current moment is not only related to the previous state but also possibly related to the future state, for example, the prediction of a missing word in a sentence needs to be judged not only according to the foregoing, but also needs to consider the content behind it, really does the judgment based on the context.

S300, respectively inputting output results of the deep learning network into at least two different types of base classifiers for integrated learning, wherein each base classifier comprises a plurality of weak learners of the same type and a strong learner;

wherein, the strong learner can be integrated by combining the same type of weak learner with the strategy, and the weak learner and the strong learner are well known in the art and will not be described in detail herein. The base classifier herein may include, but is not limited to: at least two of the base classifiers can be selected for effective fusion based on the base classifier of the conditional random field, the base classifier based on the structured support vector machine and the base classifier based on the maximum space Markov network, and the three base classifiers are preferably used in the embodiment.

In the field, the labeling algorithm based on the neural network model has some defects, when the data scale is small, the performance of the traditional labeling algorithm is slightly superior to that of the labeling algorithm based on the neural network model, and when the dependence requirement of the identified task on long-distance information is not high, the complexity of the model is only increased by using the labeling algorithm based on the neural network, but the performance cannot be effectively improved; base classifiers also have a number of disadvantages, such as: the hidden Markov model only depends on each state and the observation phenomenon corresponding to the state, and the target function is not matched with the prediction function; while conditional random field models solve the problem of label bias, the models become correspondingly complex. Therefore, in order to effectively alleviate the above problems, the method effectively fuses the deep learning and the sequence labeling algorithm of at least two different types of classifiers through steps S200 and S300, so as to improve the accuracy of labeling.

S400, constructing a transition probability matrix from the output data in all the base classifiers, solving the transition probability matrix, and generating a labeling result.

The transition probability matrix can be solved by softmax (logistic regression) algorithm or viterbi (viterbi) algorithm, and the viterbi algorithm with lower algorithm complexity can be optimized.

The method of this embodiment adopts a Bagging (Bootstrap aggregation) method, for example: in step S100, firstly, n unlabelled data are extracted from the test question data set according to a self-service sampling method (bootstrapping), the n unlabelled data of one round can be used as a training set, and k training sets are obtained after k rounds of extraction are performed; because random sampling is adopted, each training set is different and independent from each other, then in steps S200 and S300, the training sets are sent into a deep learning network for training and then sent into a base classifier for learning, in each round of random sampling process, some data in the test data set may be repeatedly sampled, some data may not be sampled, the data which is not sampled does not participate in fitting of a training model, the data which is not sampled in the part can be used for detecting generalization capability of the model, and each weak learner in each base classifier can adopt a voting mode to obtain an optimal value, so that a final labeling effect is guaranteed; and finally, obtaining all the prediction output values of each strong learner by adopting a combination strategy of arithmetic mean, then generating a transition probability matrix, wherein all elements of the transition probability matrix are non-negative, the sum of all the elements in each row is equal to 1, all the elements are represented by probability, and solving the transition probability matrix can obtain the probability of which category a certain text segment belongs, namely a labeling result.

The method provides a new combined model, combines a deep learning network and at least two different types of base classifiers, and improves the accuracy of labeling by combination;

the method effectively solves the problem of labeling unstructured data such as test paper, and on the basis, the method can solve the problem of automatic storage of unstructured texts and save a large amount of manual labor.

The method of this embodiment is different from the ordinary word labeling in that the original data can be directly labeled by the labeling rule, while the ordinary word labeling method generally uses a dictionary format for labeling and needs to rearrange and combine the data.

Referring to fig. 4 and 5, an embodiment of the present invention provides a labeling method for unstructured test question data.

As shown in fig. 4, the original data is divided into three parts, wherein the first part is unmarked data sampled by a self-service sampling method; the second part is marked data which is marked manually; the third part is unmarked data to be detected; training the data of the first part and the second part, generating a model, and finally realizing automatic labeling of the data of the third part (unstructured text data) through the model.

As shown in FIG. 5, when a training model is constructed, a deep learning algorithm Bi L STM is used, a Conditional Random Field (CRF), a maximum interval Markov network (M3N), a structured Support Vector Machine (SVM) and a Bagging method are combined, and finally a Viterbi algorithm is used to obtain a final multi-classification result, namely, which category each text segment belongs to is obtained.

The method comprises the following specific processes:

firstly, after vectorization, inputting unmarked data into a Bi L STM network, using the marked data for correction, then, respectively sending output results of the Bi L STM network as input into a CRF weak learner, an SVM weak learner and an M3N weak learner, wherein the three base classifiers are respectively used for integrated learning, because a self-sampling method is adopted to process the input data and used for learning of different weak learners, each sample set of random sampling is different, so that different weak learners can be obtained, the weak learners of the same type can form corresponding strong learners after being integrated by a combination strategy, in each round of random sampling process, data which are not sampled do not participate in fitting of a training model, the part of Viterbi data can be used for detecting generalization capability of the model, finally, the combination strategy is an arithmetic mean to obtain all predicted output values of each strong learner, all predicted output values are generated into a transition probability matrix, elements in the matrix are all nonnegative, the sum of the elements is equal to 1, each element is represented by a probability, and a text is used for obtaining a final segmentation probability result, namely, and a segmentation algorithm is used for solving a certain class.

The new combination model provided by the method combines a Bi L STM deep learning network, a conditional random field, a maximum interval Markov network and a structured support vector machine, and improves the accuracy of labeling by combination.

The method effectively solves the problem of marking unstructured data such as test paper and the like, and on the basis, the method can solve the problem of automatic storage of unstructured texts and save a large amount of manual labor.

The method of this embodiment is different from the ordinary word labeling in that the original data can be directly labeled by the labeling rule, and the ordinary word labeling method generally uses a dictionary format for labeling and requires rearrangement and combination of data. By the method, the manual marking at the early stage can be carried out more quickly and efficiently.

The test question labeling rules and labeling examples of any of the above embodiments are provided below:

firstly, labeling rules:

the labeling of test questions and test papers aims at realizing the labeling of each component of the test papers, and the labeling method is different from the common word labeling method in that the labeling can be directly performed on original data through the labeling rule, while the common word labeling method generally uses dictionary format recording and needs to rearrange the data.

Beginning of the test paper title: [ SJBT ]

The test paper title ends: [/SJBT ]

Start of test paper description: [ SJSM ]

And (5) ending the test paper description: [/SJSM ]

Beginning with the description of the types: [ TX ]

End of topic description: [/TX ]

Beginning with the stem: [ TG ]

And (4) ending the stem: [/TG ]

The problem starts: [ WT ]

And (4) ending the problem: [/WT ]

Description of points of knowledge start: [ ZSD ]

Description of points of knowledge start: [/ZSD ]

The 1 st choice question option begins: [ XZ1]

The 1 st choice question option ends: [/XZ1]

The 2 nd choice question option begins: [ XZ2]

The 2 nd choice question option ends: [/XZ2]

.., and so on

The answer starts: [ DA ]

And (5) answer ending: [/DA ]

Analysis is started: [ JX ]

And (5) finishing analysis: [/JX ]

Other descriptions:

beginning of test questions: [ ST ]

And (4) ending the test question: [/ST ]

Examination question type description:

single choice questions: [ DANX/]

Multiple choice questions: [ DUOX/]

Question judgment: [ PDT/]

Filling in the blank: [ TKT/]

Simple answering: [ JDT/]

Completing shape filling: [ WXTK/]

Read to understand [ YD L J/].

Secondly, marking an example:

(1) original test paper content example:

test questions of 3 grades of a primary school in a certain city:

firstly, selecting questions.

1. Which are odd?

A、4B、5C、8

And (3) answer: b is

And (3) analysis: the number that cannot be divided exactly by 2 is odd;

(2) example of a marked test paper:

[ SJBT ] test question of grade 3 of a certain primary school in certain City: [/SJBT ]

[ TX ] I, choose topic. [/TX ]

[ST]

[DANX/]

[ ZSD ] odd and even [/ZSD ]

[ WT ]1, which are odd? [/WT ]

[XZ1]A、4[/XZ1][XZ2]B、5[/XZ2][XZ3]C、8[/XZ3]

[ DA ] answer: b [/DA ]

[ JX ] analysis: the number not divisible by 2 being odd [/JX ]

[/ST]。

Referring to fig. 6, an embodiment of the present invention further provides a labeling device for unstructured test question data, where the labeling device for unstructured test question data may be any type of intelligent terminal, such as a mobile phone, a tablet computer, a personal computer, and the like.

Specifically, the device for labeling unstructured test question data comprises: one or more control processors and memory, one control processor being exemplified in fig. 6. The control processor and the memory may be connected by a bus or other means, as exemplified by the bus connection in fig. 6.

The memory is used as a non-transitory computer readable storage medium, and can be used to store a non-transitory software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to the labeling device of the unstructured test question data in the embodiment of the present invention, and the control processor implements the method for labeling the unstructured test question data in the above method embodiment by running the non-transitory software program, instruction, and module stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store the generated data. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes a memory remotely located from the control processor, and the remote memories may be connected to the annotating device of the unstructured test question data via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory and, when executed by the one or more control processors, perform the method of annotating unstructured test question data in the above-described method embodiments, e.g., performing the method steps S100-S400 in fig. 2 described above.

Embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions, which are executed by one or more control processors, for example, by one of the control processors in fig. 6, and can cause the one or more control processors to execute the method for labeling unstructured test question data in the above method embodiments, for example, to execute the above-described method steps S100-S400 in fig. 2.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art can clearly understand that the embodiments can be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A labeling method for unstructured test question data is characterized by comprising the following steps:

2. The method for labeling unstructured test question data according to claim 1, characterized in that the deep learning network is a Bi L STM network.

3. The method for labeling unstructured test question data according to claim 1 or 2, characterized in that the base classifier comprises: a base classifier based on conditional random fields, a base classifier based on a structured support vector machine, and a base classifier based on a maximum space markov network.

4. The method for labeling unstructured test question data according to claim 3, characterized in that the transition probability matrix is solved by using a viterbi algorithm.

5. The method for labeling unstructured test question data according to claim 3, characterized in that the preprocessing comprises: extracting test question texts, segmenting the test question texts and vectorizing the test question texts.

6. An apparatus for labeling unstructured test question data, comprising: at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of annotating unstructured test question data according to any one of claims 1 to 5.

7. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of labeling unstructured test question data according to any one of claims 1 to 5.