CN112989760A

CN112989760A - Method and device for labeling subjects, storage medium and electronic equipment

Info

Publication number: CN112989760A
Application number: CN201911301442.9A
Authority: CN
Inventors: 张弦; 程辉; 邢德阳; 董平
Original assignee: Beijing Yiyi Education Information Consulting Co ltd
Current assignee: Beijing Yiyi Education Information Consulting Co ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2021-06-18

Abstract

The embodiment of the invention provides a method, a device, a storage medium and electronic equipment for labeling subjects. Therefore, the scheme provides a mode of automatically labeling the questions based on the type information of the questions, and the problems of low labeling efficiency and high cost caused by manual labeling are solved.

Description

Method and device for labeling subjects, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of document classification, in particular to a method and a device for tagging questions, a storage medium and electronic equipment.

Background

Generally, multiple topics are classified by tagging the topics, for example, the tags may be specifically a topic type, a difficulty level, a knowledge point, and the like, and finally, the association between the topics and a certain dimension is achieved. At present, marking the titles is carried out in a manual mode, the accuracy of the mode is high, but the efficiency is low, and a large amount of time and labor cost are needed.

Therefore, how to label the title can not only ensure the accuracy, but also improve the efficiency and reduce the cost is a great technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a method for labeling questions, which can ensure accuracy, improve efficiency, and reduce cost.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

a method of tagging a topic, comprising:

determining type information of a topic to be labeled, wherein the type information at least comprises a calculation class and an application class;

and determining the target label of the title to be labeled based on the type information of the title to be labeled.

Optionally, when the type information of the to-be-labeled question is the caco class, determining the target label of the to-be-labeled question based on the type information of the to-be-labeled question, including:

acquiring target text information of the question to be labeled, wherein the target text information at least comprises a question stem and an answer of the question to be labeled;

determining a target arithmetic expression based on the target character information;

analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruple;

and determining the target label of the question to be labeled based on the target quadruple, the target arithmetic expression and a preset division rule.

Optionally, the determining a target arithmetic expression based on the target text information includes:

based on a regular expression, carrying out data cleaning on the target character information to obtain de-noised character information;

and extracting information of the de-noised character information to obtain the target arithmetic expression.

Optionally, the determining a target tag of the to-be-tagged question based on the target quadruple, the target arithmetic expression and a preset partition rule includes:

acquiring a maximum operand in the target quadruple;

acquiring bracket information in the target arithmetic expression;

and determining that the label meeting the maximum operand and the bracket information in the preset division rule is the target label of the to-be-labeled question based on the maximum operand and the bracket information.

Optionally, when the type information of the to-be-labeled title is the application class, determining the target label of the to-be-labeled title based on the type information of the to-be-labeled title includes:

acquiring target text information of the subject to be labeled, wherein the target text information at least comprises a subject stem and analysis of the subject to be labeled;

performing word segmentation and word stop processing on the target character information to obtain a word characteristic set;

calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;

training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;

and inputting the questions to be labeled into the target classifier, and outputting the target labels of the questions to be labeled.

An apparatus for tagging a title, comprising:

the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining type information of a topic to be labeled, and the type information at least comprises a calculation class and an application class;

and the second determining module is used for determining the target label of the title to be labeled based on the type information of the title to be labeled.

Optionally, when the type information of the topic to be labeled is the caco class, the second determining module includes:

a first obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an answer of the to-be-labeled question;

the first determining unit is used for determining a target arithmetic expression based on the target character information;

the first processing unit is used for analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruplet;

the second determining unit is used for determining a target label of the to-be-labeled question based on the target quadruple, the target arithmetic expression and a preset division rule;

or the like, or, alternatively,

when the type information of the title to be labeled is the application class, the second determining module comprises:

a second obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an analysis of the to-be-labeled question;

the second processing unit is used for carrying out word segmentation and word stop processing on the target character information to obtain a word characteristic set;

the calculation unit is used for calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;

the third processing unit is used for training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;

and the output unit is used for inputting the to-be-labeled question into the target classifier and outputting the target label of the to-be-labeled question.

Optionally, the first determining unit includes:

the first processing subunit is used for carrying out data cleaning on the target text information based on a regular expression to obtain de-noised text information;

the second processing subunit is used for extracting information from the de-noised text information to obtain the target arithmetic expression;

and/or the second determination unit comprises:

a first obtaining subunit, configured to obtain a maximum operand in the target quadruple;

the second acquisition subunit is used for acquiring the bracket information in the target arithmetic expression;

and the determining subunit is configured to determine, based on the maximum operand and the parenthesis information, that a tag in the preset partitioning rule, which meets the maximum operand and the parenthesis information, is a target tag of the to-be-tagged title.

A storage medium, the storage medium comprising a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute any one of the above methods for tagging titles.

An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform any of the above methods for tagging topics.

Based on the technical scheme, the embodiment of the invention provides a method, a device, a storage medium and electronic equipment for labeling subjects. Therefore, the scheme provides a mode of automatically labeling the questions based on the type information of the questions, and the problems of low labeling efficiency and high cost caused by manual labeling are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for labeling a title according to an embodiment of the present invention;

fig. 7 is a hardware architecture diagram of a system for tagging titles according to an embodiment of the present invention.

Detailed Description

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for tagging titles according to an embodiment of the present invention, where the method includes:

and S11, determining the type information of the title to be labeled.

S12, determining the target label of the title to be labeled based on the type information of the title to be labeled.

It should be noted that, in this embodiment, the type information at least includes a mouth calculation class and an application class. According to the scheme, the questions are automatically labeled based on the type information of the questions, and the problems of low labeling efficiency and high cost caused by manual labeling are solved.

Specifically, because the type information is different, the embodiment of the present invention provides two specific implementation manners for determining the target tag of the topic to be tagged based on the type information of the topic to be tagged, which are as follows:

in a first mode

When the type information of the to-be-labeled title is the caco type, as shown in fig. 2, determining the target label of the to-be-labeled title based on the type information of the to-be-labeled title, including the steps of:

s21, obtaining target text information of the to-be-labeled question, wherein the target text information at least comprises a question stem and an answer of the to-be-labeled question.

For example, the question stem of the original question (e.g. 2\ times (99+6) $ __ $ $1__), the answer (210), etc. information is taken from the database.

S22, determining a target arithmetic expression based on the target character information;

specifically, as shown in fig. 3, an embodiment of the present invention provides a specific implementation manner for determining a target arithmetic expression based on the target text information, including the steps of:

s31, based on the regular expression, carrying out data cleaning on the target character information to obtain de-noised character information;

and S32, extracting information of the de-noised character information to obtain the target arithmetic expression.

Illustratively, useless information such as html tags ( ), spaces and the like in the information is cleaned by using a regular expression and other technologies, and a question stem (2\ times (99+6) ═ __ $1__) and an answer (210) of the noiseless information are obtained. Then, the latex expression (\ \ times) is replaced by a corresponding operator (×), and the answer (210) is replaced and filled in the text (__ $1__), and the arithmetic expression (2 × (99+6) ═ 210) is extracted.

S23, analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruple;

the arithmetic expressions have been extracted above, and then, for the extracted arithmetic expressions, the latter are parsed by a context-free grammar and stored by a tree of quadruplets, such as: [ '+,99,6, TREE1', 'C'

×,2,TREE1,TREE2','＝,TREE2,210,TREE3']。

It should be noted that, the inventor considers that the granularity of the knowledge points in the mental arithmetic class is finer, for example, the addition is also an addition, and needs to be refined to be within 10 or within 100; therefore, when the knowledge points are marked, the sequence of the calculation steps and the intermediate result of each step need to be known.

S24, determining the target label of the to-be-labeled question based on the target quadruple, the target arithmetic expression and a preset division rule.

Specifically, as shown in fig. 4, an embodiment of the present invention provides a specific implementation manner for determining a target tag of a topic to be tagged based on the target quadruple, the target arithmetic expression, and a preset partitioning rule, including:

s41, acquiring the maximum operand in the target quadruple;

s42, acquiring bracket information in the target arithmetic expression;

s43, determining that the label meeting the maximum operand and the bracket information in the preset dividing rule is the target label of the to-be-labeled question based on the maximum operand and the bracket information.

Namely, the step is a process of explaining the quadruple and the arithmetic expression by combining the industry background rule and labeling the question with a knowledge point. For example, in the process of resolving a quadruple, there are add and multiply operations inside, the maximum operand for the add is 99, and the maximum operand for the multiply is 105 (the intermediate result obtained by resolving TREE 1); and analyzing the arithmetic expression, and marking a knowledge point label 'four arithmetic operations with brackets (within 1000)' on the subject by combining the information and the industry background rule because the expression is bracketed. If the extracted arithmetic expression is 2 × (9+6) ═ 30, the title will be labeled with the knowledge point label "four arithmetic operations with parentheses" (in 100) "; if the extracted arithmetic expression is 2 × 9+6 ═ 24, the title will be labeled with the knowledge point label "multiply add (7-9 in table)", "7-9 in table" indicating that the range of multiplication is in the nine multiplication pithy table and the maximum operand belongs to 7-9.

Mode two

When the type information of the topic to be labeled is the application class, as shown in fig. 5, an embodiment of the present invention provides a specific implementation manner for determining a target label of the topic to be labeled based on the type information of the topic to be labeled, including:

s51, acquiring target text information of the to-be-labeled question, wherein the target text information at least comprises a question stem and an analysis of the to-be-labeled question;

s52, performing word segmentation and word stop processing on the target character information to obtain a word characteristic set;

s53, calculating the ratio of the word frequency of each word feature in the word feature set to the frequency of the inverse document;

s54, training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;

s55, inputting the to-be-labeled question into the target classifier, and outputting the target label of the to-be-labeled question.

That is, when the type of the topic to be labeled is an application class, the method of text classification (belonging to supervised machine learning) is divided into two parts, namely training a classifier and typing a knowledge point on a new topic by using the classifier, as follows:

firstly, training a classifier:

1. data cleaning:

collecting a batch of questions with artificially labeled knowledge point labels as a data set, taking text information of the questions such as question stems, analysis and the like from a database, cleaning useless information of html labels, spaces, numbers, operators and the like in the questions by using technologies such as regular expressions and the like, and splicing the question stems and the analysis of each question together to obtain text information of each question;

for example, the labeled knowledge point label of topic Q1 is "basic formula of travel problem", and the original topic stem + is resolved as: "cheetah runs 30 meters per second, 20 seconds __ $ __ meters. [ hint ] the distance is speed x time. [ p > [ solution ] cheetah runs 30 meters per second, 20 seconds: 30 × 20 ═ 600 (meters). And the text information obtained after cleaning is' cheetah runs meter per second and second runs meter. And prompting the distance speed time. Solving the cheetah to run rice per second, running second: and (4) rice. "; the labeled knowledge point label of topic Q2 is "engineering problem basic formula", and the original topic stem + is analyzed as: "one paving team paves 500 meters for 25 hours. Calculated as such, 8.5 hours of paving __ $ $ __ meters. The work efficiency is work total amount divided by work time; the total work amount is work efficiency × work time. [ p > [ answer ] A paving team paves 500 meters in 25 hours, then paves every hour: 500 ÷ 25 ÷ 20 (meters); calculated as such, 8.5 hours paving: 20 × 8.5 ═ 170 (m). The road is paved for 8.5 hours by 170 meters. , the text information obtained after cleaning is "paving rice when one paving team is used for paving a road. The hours of paving rice were calculated as such. Prompting the working efficiency and the total working time; total work amount work efficiency work time. Solving one paving team to pave rice for one hour, paving the rice for each hour: rice; calculated as such, hourly paving: and (4) rice. And paving the rice when answering. "

2. Data preprocessing:

performing Chinese word segmentation on the text information of each topic in the data set, removing stop words (punctuation marks, rules, yes, one, prompt, answer and the like), and gathering and de-duplicating all the remaining words to be used as a word feature set for text classification. For example, Q1 obtains "Leopard runs meter second meter distance speed time Leopard runs meter second meter distance meter per second", Q2 obtains "paving team hour paving meter calculates hour paving meter work efficiency work total work efficiency work time paving team hour paving meter per hour paving meter calculates hour paving meter and answer hour paving meter per hour paving meter", and the word feature set is "cheep runs meter second distance speed time Leopard paving hour paving meter per second" calculates work efficiency total amount and answer hour paving meter per hour ". And expressing each topic by using a vector space model to obtain vectorization expression of the topic, wherein each column of the vector expresses one word feature in the word feature set obtained above, and the value of the word feature is a TF-IDF (ratio of word frequency to inverse document frequency) value of the topic under the feature. TF-IDF (t, d) ═ TF (t, d) × IDF (t), where t denotes a word and d denotes a document (in this example, a sentence). TF (t, d) is the number of times a word t appears in document d, divided by the total number of words in document d. For example, the word "cheetah" appears 2 times in Q1, the total number of words in Q1 is 17, 0 times in Q2, and the total number of words in Q2 is 39; then "cheetah" has a TF (word frequency) of 2/17-0.118 in Q1 and a TF (word frequency) of 0/39-0 in Q2. Idf (t) ═ log [ n/df (t) ] +1, where n denotes the size of the document set, and in the example given here, n is 2 assuming only two sentences of Q1Q 2; df (t) is the number of documents containing the word t, for example, the word "cheetah" appears only in Q1, the value of df (t) is 1, and the value of df (t) is 2 if the word "m" appears simultaneously in Q1Q 2. According to this algorithm, for each word (19 in total) in the word feature set (cheetah runs a meter-second distance speed time per second and the paving team works an hour paving and thus calculates the work efficiency total per hour answer), TF-IDF values are calculated with respect to Q1 and Q2, thus obtaining 2 19-dimensional vectors, and these 2 vectors are used to represent the two sentences Q1Q 2.

3. Model training:

the vector calculated for each sentence is used as X, the knowledge point labels of the sentences are used as y, and (X, y) is used as a training sample. The method comprises the steps of training the processed data by utilizing an SGDClasifier model (a classification model trained by a random gradient descent method) with loss function of modified _ huber, calculating the probability value of each input belonging to each knowledge point label for each input X by the SGDClasifier model, taking the knowledge point label with the maximum probability value as the output of the SGDClasifier model, and if an error (represented by a loss function) exists between the output of the SGDClasifier model and the actual output, adjusting the parameters in the SGDClasifier model according to the direction of reducing the error in the training process to obtain the reference data, and finally training to obtain a text classifier.

Secondly, marking knowledge points on the new questions by using a classifier:

1. and taking text information such as a question stem, analysis and the like of a new question of which the knowledge point label is unknown from a database, and carrying out data cleaning and data preprocessing, wherein the process is the same as the training part, so as to obtain vectorization representation of the question.

2. And taking the vector as the input of a text classifier model obtained by a training part, predicting a result (the model calculates the probability value of the input belonging to each knowledge point label, and takes the knowledge point label with the maximum probability value as the prediction result of the model), and taking the result as the knowledge point label of the question.

Therefore, the embodiment of the invention can determine the target label based on the type information of the topic to be labeled, for example, the arithmetic expression is analyzed by using a context-free grammar for the topic, and the middle process is stored in a quadruple tree form. The corresponding class titles are labeled by using a text classification thought.

In addition, on the basis of the above embodiments, an embodiment of the present invention further provides an apparatus for tagging titles, as shown in fig. 6, including:

the first determining module 61 is configured to determine type information of a topic to be labeled, where the type information at least includes a calculation class and an application class;

and a second determining module 62, configured to determine a target tag of the topic to be tagged based on the type information of the topic to be tagged.

Wherein, when the type information of the to-be-labeled title is the caculation class, the second determining module may include:

or the like, or, alternatively,

when the type information of the to-be-labeled title is the application class, the second determining module may include:

In addition, the first determination unit may include:

and/or the second determination unit comprises:

The title labeling device comprises a processor and a memory, wherein the first determining module, the second determining module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problems of low labeling efficiency and high cost caused by manual labeling are solved by adjusting the kernel parameters.

An embodiment of the present invention provides a storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for tagging topics.

The embodiment of the invention provides a processor, which is used for running a program, wherein the method for tagging titles is executed when the program runs.

An embodiment of the present invention provides an apparatus, as shown in fig. 7, the apparatus includes at least one processor 71, at least one memory 72 connected with the processor, and a bus 73; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the method for tagging the title. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

a method of tagging a topic, comprising:

acquiring a maximum operand in the target quadruple;

acquiring bracket information in the target arithmetic expression;

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for tagging a topic, comprising:

2. The method according to claim 1, wherein when the type information of the topic to be labeled is the caco class, the determining a target label of the topic to be labeled based on the type information of the topic to be labeled comprises:

3. The method for tagging a topic according to claim 2, wherein the determining a target arithmetic expression based on the target textual information comprises:

4. The method for tagging topics according to claim 2, wherein the determining the target tag of the topic to be tagged based on the target quadruplet, the target arithmetic expression and a preset partitioning rule comprises:

acquiring a maximum operand in the target quadruple;

acquiring bracket information in the target arithmetic expression;

5. The method according to claim 1, wherein when the type information of the topic to be labeled is the application class, determining the target label of the topic to be labeled based on the type information of the topic to be labeled comprises:

6. An apparatus for tagging a title, comprising:

7. The apparatus for labeling topics according to claim 6, wherein when the type information of the topic to be labeled is the cactus class, the second determining module comprises:

or the like, or, alternatively,

8. The apparatus for labeling items according to claim 6, wherein said first determining unit comprises:

and/or the second determination unit comprises:

9. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform the method for tagging titles according to any one of claims 1 to 5.

10. An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method of tagging topics according to any one of claims 1 to 5.