CN112989760A - Method and device for labeling subjects, storage medium and electronic equipment - Google Patents

Method and device for labeling subjects, storage medium and electronic equipment Download PDF

Info

Publication number
CN112989760A
CN112989760A CN201911301442.9A CN201911301442A CN112989760A CN 112989760 A CN112989760 A CN 112989760A CN 201911301442 A CN201911301442 A CN 201911301442A CN 112989760 A CN112989760 A CN 112989760A
Authority
CN
China
Prior art keywords
target
labeled
information
determining
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911301442.9A
Other languages
Chinese (zh)
Inventor
张弦
程辉
邢德阳
董平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyi Education Information Consulting Co ltd
Original Assignee
Beijing Yiyi Education Information Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyi Education Information Consulting Co ltd filed Critical Beijing Yiyi Education Information Consulting Co ltd
Priority to CN201911301442.9A priority Critical patent/CN112989760A/en
Publication of CN112989760A publication Critical patent/CN112989760A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method, a device, a storage medium and electronic equipment for labeling subjects. Therefore, the scheme provides a mode of automatically labeling the questions based on the type information of the questions, and the problems of low labeling efficiency and high cost caused by manual labeling are solved.

Description

Method and device for labeling subjects, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of document classification, in particular to a method and a device for tagging questions, a storage medium and electronic equipment.
Background
Generally, multiple topics are classified by tagging the topics, for example, the tags may be specifically a topic type, a difficulty level, a knowledge point, and the like, and finally, the association between the topics and a certain dimension is achieved. At present, marking the titles is carried out in a manual mode, the accuracy of the mode is high, but the efficiency is low, and a large amount of time and labor cost are needed.
Therefore, how to label the title can not only ensure the accuracy, but also improve the efficiency and reduce the cost is a great technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a method for labeling questions, which can ensure accuracy, improve efficiency, and reduce cost.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
a method of tagging a topic, comprising:
determining type information of a topic to be labeled, wherein the type information at least comprises a calculation class and an application class;
and determining the target label of the title to be labeled based on the type information of the title to be labeled.
Optionally, when the type information of the to-be-labeled question is the caco class, determining the target label of the to-be-labeled question based on the type information of the to-be-labeled question, including:
acquiring target text information of the question to be labeled, wherein the target text information at least comprises a question stem and an answer of the question to be labeled;
determining a target arithmetic expression based on the target character information;
analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruple;
and determining the target label of the question to be labeled based on the target quadruple, the target arithmetic expression and a preset division rule.
Optionally, the determining a target arithmetic expression based on the target text information includes:
based on a regular expression, carrying out data cleaning on the target character information to obtain de-noised character information;
and extracting information of the de-noised character information to obtain the target arithmetic expression.
Optionally, the determining a target tag of the to-be-tagged question based on the target quadruple, the target arithmetic expression and a preset partition rule includes:
acquiring a maximum operand in the target quadruple;
acquiring bracket information in the target arithmetic expression;
and determining that the label meeting the maximum operand and the bracket information in the preset division rule is the target label of the to-be-labeled question based on the maximum operand and the bracket information.
Optionally, when the type information of the to-be-labeled title is the application class, determining the target label of the to-be-labeled title based on the type information of the to-be-labeled title includes:
acquiring target text information of the subject to be labeled, wherein the target text information at least comprises a subject stem and analysis of the subject to be labeled;
performing word segmentation and word stop processing on the target character information to obtain a word characteristic set;
calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;
training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
and inputting the questions to be labeled into the target classifier, and outputting the target labels of the questions to be labeled.
An apparatus for tagging a title, comprising:
the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining type information of a topic to be labeled, and the type information at least comprises a calculation class and an application class;
and the second determining module is used for determining the target label of the title to be labeled based on the type information of the title to be labeled.
Optionally, when the type information of the topic to be labeled is the caco class, the second determining module includes:
a first obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an answer of the to-be-labeled question;
the first determining unit is used for determining a target arithmetic expression based on the target character information;
the first processing unit is used for analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruplet;
the second determining unit is used for determining a target label of the to-be-labeled question based on the target quadruple, the target arithmetic expression and a preset division rule;
or the like, or, alternatively,
when the type information of the title to be labeled is the application class, the second determining module comprises:
a second obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an analysis of the to-be-labeled question;
the second processing unit is used for carrying out word segmentation and word stop processing on the target character information to obtain a word characteristic set;
the calculation unit is used for calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;
the third processing unit is used for training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
and the output unit is used for inputting the to-be-labeled question into the target classifier and outputting the target label of the to-be-labeled question.
Optionally, the first determining unit includes:
the first processing subunit is used for carrying out data cleaning on the target text information based on a regular expression to obtain de-noised text information;
the second processing subunit is used for extracting information from the de-noised text information to obtain the target arithmetic expression;
and/or the second determination unit comprises:
a first obtaining subunit, configured to obtain a maximum operand in the target quadruple;
the second acquisition subunit is used for acquiring the bracket information in the target arithmetic expression;
and the determining subunit is configured to determine, based on the maximum operand and the parenthesis information, that a tag in the preset partitioning rule, which meets the maximum operand and the parenthesis information, is a target tag of the to-be-tagged title.
A storage medium, the storage medium comprising a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute any one of the above methods for tagging titles.
An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform any of the above methods for tagging topics.
Based on the technical scheme, the embodiment of the invention provides a method, a device, a storage medium and electronic equipment for labeling subjects. Therefore, the scheme provides a mode of automatically labeling the questions based on the type information of the questions, and the problems of low labeling efficiency and high cost caused by manual labeling are solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of a method for tagging a title according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for labeling a title according to an embodiment of the present invention;
fig. 7 is a hardware architecture diagram of a system for tagging titles according to an embodiment of the present invention.
Detailed Description
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for tagging titles according to an embodiment of the present invention, where the method includes:
and S11, determining the type information of the title to be labeled.
S12, determining the target label of the title to be labeled based on the type information of the title to be labeled.
It should be noted that, in this embodiment, the type information at least includes a mouth calculation class and an application class. According to the scheme, the questions are automatically labeled based on the type information of the questions, and the problems of low labeling efficiency and high cost caused by manual labeling are solved.
Specifically, because the type information is different, the embodiment of the present invention provides two specific implementation manners for determining the target tag of the topic to be tagged based on the type information of the topic to be tagged, which are as follows:
in a first mode
When the type information of the to-be-labeled title is the caco type, as shown in fig. 2, determining the target label of the to-be-labeled title based on the type information of the to-be-labeled title, including the steps of:
s21, obtaining target text information of the to-be-labeled question, wherein the target text information at least comprises a question stem and an answer of the to-be-labeled question.
For example, the question stem of the original question (e.g. < p >2\ times (99+6) $ __ $ $1__), the answer (210), etc. information is taken from the database.
S22, determining a target arithmetic expression based on the target character information;
specifically, as shown in fig. 3, an embodiment of the present invention provides a specific implementation manner for determining a target arithmetic expression based on the target text information, including the steps of:
s31, based on the regular expression, carrying out data cleaning on the target character information to obtain de-noised character information;
and S32, extracting information of the de-noised character information to obtain the target arithmetic expression.
Illustratively, useless information such as html tags (< p > </p >), spaces and the like in the information is cleaned by using a regular expression and other technologies, and a question stem (2\ times (99+6) ═ __ $1__) and an answer (210) of the noiseless information are obtained. Then, the latex expression (\ \ times) is replaced by a corresponding operator (×), and the answer (210) is replaced and filled in the text (__ $1__), and the arithmetic expression (2 × (99+6) ═ 210) is extracted.
S23, analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruple;
the arithmetic expressions have been extracted above, and then, for the extracted arithmetic expressions, the latter are parsed by a context-free grammar and stored by a tree of quadruplets, such as: [ '+,99,6, TREE1', 'C'
×,2,TREE1,TREE2','=,TREE2,210,TREE3']。
It should be noted that, the inventor considers that the granularity of the knowledge points in the mental arithmetic class is finer, for example, the addition is also an addition, and needs to be refined to be within 10 or within 100; therefore, when the knowledge points are marked, the sequence of the calculation steps and the intermediate result of each step need to be known.
S24, determining the target label of the to-be-labeled question based on the target quadruple, the target arithmetic expression and a preset division rule.
Specifically, as shown in fig. 4, an embodiment of the present invention provides a specific implementation manner for determining a target tag of a topic to be tagged based on the target quadruple, the target arithmetic expression, and a preset partitioning rule, including:
s41, acquiring the maximum operand in the target quadruple;
s42, acquiring bracket information in the target arithmetic expression;
s43, determining that the label meeting the maximum operand and the bracket information in the preset dividing rule is the target label of the to-be-labeled question based on the maximum operand and the bracket information.
Namely, the step is a process of explaining the quadruple and the arithmetic expression by combining the industry background rule and labeling the question with a knowledge point. For example, in the process of resolving a quadruple, there are add and multiply operations inside, the maximum operand for the add is 99, and the maximum operand for the multiply is 105 (the intermediate result obtained by resolving TREE 1); and analyzing the arithmetic expression, and marking a knowledge point label 'four arithmetic operations with brackets (within 1000)' on the subject by combining the information and the industry background rule because the expression is bracketed. If the extracted arithmetic expression is 2 × (9+6) ═ 30, the title will be labeled with the knowledge point label "four arithmetic operations with parentheses" (in 100) "; if the extracted arithmetic expression is 2 × 9+6 ═ 24, the title will be labeled with the knowledge point label "multiply add (7-9 in table)", "7-9 in table" indicating that the range of multiplication is in the nine multiplication pithy table and the maximum operand belongs to 7-9.
Mode two
When the type information of the topic to be labeled is the application class, as shown in fig. 5, an embodiment of the present invention provides a specific implementation manner for determining a target label of the topic to be labeled based on the type information of the topic to be labeled, including:
s51, acquiring target text information of the to-be-labeled question, wherein the target text information at least comprises a question stem and an analysis of the to-be-labeled question;
s52, performing word segmentation and word stop processing on the target character information to obtain a word characteristic set;
s53, calculating the ratio of the word frequency of each word feature in the word feature set to the frequency of the inverse document;
s54, training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
s55, inputting the to-be-labeled question into the target classifier, and outputting the target label of the to-be-labeled question.
That is, when the type of the topic to be labeled is an application class, the method of text classification (belonging to supervised machine learning) is divided into two parts, namely training a classifier and typing a knowledge point on a new topic by using the classifier, as follows:
firstly, training a classifier:
1. data cleaning:
collecting a batch of questions with artificially labeled knowledge point labels as a data set, taking text information of the questions such as question stems, analysis and the like from a database, cleaning useless information of html labels, spaces, numbers, operators and the like in the questions by using technologies such as regular expressions and the like, and splicing the question stems and the analysis of each question together to obtain text information of each question;
for example, the labeled knowledge point label of topic Q1 is "basic formula of travel problem", and the original topic stem + is resolved as: "cheetah runs 30 meters per second, 20 seconds __ $ __ meters. < p > [ hint ] the distance is speed x time. [ p > < p > [ solution ] cheetah runs 30 meters per second, 20 seconds: 30 × 20 ═ 600 (meters). And the text information obtained after cleaning is' cheetah runs meter per second and second runs meter. And prompting the distance speed time. Solving the cheetah to run rice per second, running second: and (4) rice. "; the labeled knowledge point label of topic Q2 is "engineering problem basic formula", and the original topic stem + is analyzed as: "one paving team paves 500 meters for 25 hours. Calculated as such, 8.5 hours of paving __ $ $ __ meters. The work efficiency is work total amount divided by work time; the total work amount is work efficiency × work time. [ p > < p > [ answer ] A paving team paves 500 meters in 25 hours, then paves every hour: 500 ÷ 25 ÷ 20 (meters); calculated as such, 8.5 hours paving: 20 × 8.5 ═ 170 (m). The road is paved for 8.5 hours by 170 meters. </p >, the text information obtained after cleaning is "paving rice when one paving team is used for paving a road. The hours of paving rice were calculated as such. Prompting the working efficiency and the total working time; total work amount work efficiency work time. Solving one paving team to pave rice for one hour, paving the rice for each hour: rice; calculated as such, hourly paving: and (4) rice. And paving the rice when answering. "
2. Data preprocessing:
performing Chinese word segmentation on the text information of each topic in the data set, removing stop words (punctuation marks, rules, yes, one, prompt, answer and the like), and gathering and de-duplicating all the remaining words to be used as a word feature set for text classification. For example, Q1 obtains "Leopard runs meter second meter distance speed time Leopard runs meter second meter distance meter per second", Q2 obtains "paving team hour paving meter calculates hour paving meter work efficiency work total work efficiency work time paving team hour paving meter per hour paving meter calculates hour paving meter and answer hour paving meter per hour paving meter", and the word feature set is "cheep runs meter second distance speed time Leopard paving hour paving meter per second" calculates work efficiency total amount and answer hour paving meter per hour ". And expressing each topic by using a vector space model to obtain vectorization expression of the topic, wherein each column of the vector expresses one word feature in the word feature set obtained above, and the value of the word feature is a TF-IDF (ratio of word frequency to inverse document frequency) value of the topic under the feature. TF-IDF (t, d) ═ TF (t, d) × IDF (t), where t denotes a word and d denotes a document (in this example, a sentence). TF (t, d) is the number of times a word t appears in document d, divided by the total number of words in document d. For example, the word "cheetah" appears 2 times in Q1, the total number of words in Q1 is 17, 0 times in Q2, and the total number of words in Q2 is 39; then "cheetah" has a TF (word frequency) of 2/17-0.118 in Q1 and a TF (word frequency) of 0/39-0 in Q2. Idf (t) ═ log [ n/df (t) ] +1, where n denotes the size of the document set, and in the example given here, n is 2 assuming only two sentences of Q1Q 2; df (t) is the number of documents containing the word t, for example, the word "cheetah" appears only in Q1, the value of df (t) is 1, and the value of df (t) is 2 if the word "m" appears simultaneously in Q1Q 2. According to this algorithm, for each word (19 in total) in the word feature set (cheetah runs a meter-second distance speed time per second and the paving team works an hour paving and thus calculates the work efficiency total per hour answer), TF-IDF values are calculated with respect to Q1 and Q2, thus obtaining 2 19-dimensional vectors, and these 2 vectors are used to represent the two sentences Q1Q 2.
3. Model training:
the vector calculated for each sentence is used as X, the knowledge point labels of the sentences are used as y, and (X, y) is used as a training sample. The method comprises the steps of training the processed data by utilizing an SGDClasifier model (a classification model trained by a random gradient descent method) with loss function of modified _ huber, calculating the probability value of each input belonging to each knowledge point label for each input X by the SGDClasifier model, taking the knowledge point label with the maximum probability value as the output of the SGDClasifier model, and if an error (represented by a loss function) exists between the output of the SGDClasifier model and the actual output, adjusting the parameters in the SGDClasifier model according to the direction of reducing the error in the training process to obtain the reference data, and finally training to obtain a text classifier.
Secondly, marking knowledge points on the new questions by using a classifier:
1. and taking text information such as a question stem, analysis and the like of a new question of which the knowledge point label is unknown from a database, and carrying out data cleaning and data preprocessing, wherein the process is the same as the training part, so as to obtain vectorization representation of the question.
2. And taking the vector as the input of a text classifier model obtained by a training part, predicting a result (the model calculates the probability value of the input belonging to each knowledge point label, and takes the knowledge point label with the maximum probability value as the prediction result of the model), and taking the result as the knowledge point label of the question.
Therefore, the embodiment of the invention can determine the target label based on the type information of the topic to be labeled, for example, the arithmetic expression is analyzed by using a context-free grammar for the topic, and the middle process is stored in a quadruple tree form. The corresponding class titles are labeled by using a text classification thought.
In addition, on the basis of the above embodiments, an embodiment of the present invention further provides an apparatus for tagging titles, as shown in fig. 6, including:
the first determining module 61 is configured to determine type information of a topic to be labeled, where the type information at least includes a calculation class and an application class;
and a second determining module 62, configured to determine a target tag of the topic to be tagged based on the type information of the topic to be tagged.
Wherein, when the type information of the to-be-labeled title is the caculation class, the second determining module may include:
a first obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an answer of the to-be-labeled question;
the first determining unit is used for determining a target arithmetic expression based on the target character information;
the first processing unit is used for analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruplet;
the second determining unit is used for determining a target label of the to-be-labeled question based on the target quadruple, the target arithmetic expression and a preset division rule;
or the like, or, alternatively,
when the type information of the to-be-labeled title is the application class, the second determining module may include:
a second obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an analysis of the to-be-labeled question;
the second processing unit is used for carrying out word segmentation and word stop processing on the target character information to obtain a word characteristic set;
the calculation unit is used for calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;
the third processing unit is used for training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
and the output unit is used for inputting the to-be-labeled question into the target classifier and outputting the target label of the to-be-labeled question.
In addition, the first determination unit may include:
the first processing subunit is used for carrying out data cleaning on the target text information based on a regular expression to obtain de-noised text information;
the second processing subunit is used for extracting information from the de-noised text information to obtain the target arithmetic expression;
and/or the second determination unit comprises:
a first obtaining subunit, configured to obtain a maximum operand in the target quadruple;
the second acquisition subunit is used for acquiring the bracket information in the target arithmetic expression;
and the determining subunit is configured to determine, based on the maximum operand and the parenthesis information, that a tag in the preset partitioning rule, which meets the maximum operand and the parenthesis information, is a target tag of the to-be-tagged title.
The title labeling device comprises a processor and a memory, wherein the first determining module, the second determining module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problems of low labeling efficiency and high cost caused by manual labeling are solved by adjusting the kernel parameters.
An embodiment of the present invention provides a storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for tagging topics.
The embodiment of the invention provides a processor, which is used for running a program, wherein the method for tagging titles is executed when the program runs.
An embodiment of the present invention provides an apparatus, as shown in fig. 7, the apparatus includes at least one processor 71, at least one memory 72 connected with the processor, and a bus 73; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the method for tagging the title. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
a method of tagging a topic, comprising:
determining type information of a topic to be labeled, wherein the type information at least comprises a calculation class and an application class;
and determining the target label of the title to be labeled based on the type information of the title to be labeled.
Optionally, when the type information of the to-be-labeled question is the caco class, determining the target label of the to-be-labeled question based on the type information of the to-be-labeled question, including:
acquiring target text information of the question to be labeled, wherein the target text information at least comprises a question stem and an answer of the question to be labeled;
determining a target arithmetic expression based on the target character information;
analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruple;
and determining the target label of the question to be labeled based on the target quadruple, the target arithmetic expression and a preset division rule.
Optionally, the determining a target arithmetic expression based on the target text information includes:
based on a regular expression, carrying out data cleaning on the target character information to obtain de-noised character information;
and extracting information of the de-noised character information to obtain the target arithmetic expression.
Optionally, the determining a target tag of the to-be-tagged question based on the target quadruple, the target arithmetic expression and a preset partition rule includes:
acquiring a maximum operand in the target quadruple;
acquiring bracket information in the target arithmetic expression;
and determining that the label meeting the maximum operand and the bracket information in the preset division rule is the target label of the to-be-labeled question based on the maximum operand and the bracket information.
Optionally, when the type information of the to-be-labeled title is the application class, determining the target label of the to-be-labeled title based on the type information of the to-be-labeled title includes:
acquiring target text information of the subject to be labeled, wherein the target text information at least comprises a subject stem and analysis of the subject to be labeled;
performing word segmentation and word stop processing on the target character information to obtain a word characteristic set;
calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;
training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
and inputting the questions to be labeled into the target classifier, and outputting the target labels of the questions to be labeled.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for tagging a topic, comprising:
determining type information of a topic to be labeled, wherein the type information at least comprises a calculation class and an application class;
and determining the target label of the title to be labeled based on the type information of the title to be labeled.
2. The method according to claim 1, wherein when the type information of the topic to be labeled is the caco class, the determining a target label of the topic to be labeled based on the type information of the topic to be labeled comprises:
acquiring target text information of the question to be labeled, wherein the target text information at least comprises a question stem and an answer of the question to be labeled;
determining a target arithmetic expression based on the target character information;
analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruple;
and determining the target label of the question to be labeled based on the target quadruple, the target arithmetic expression and a preset division rule.
3. The method for tagging a topic according to claim 2, wherein the determining a target arithmetic expression based on the target textual information comprises:
based on a regular expression, carrying out data cleaning on the target character information to obtain de-noised character information;
and extracting information of the de-noised character information to obtain the target arithmetic expression.
4. The method for tagging topics according to claim 2, wherein the determining the target tag of the topic to be tagged based on the target quadruplet, the target arithmetic expression and a preset partitioning rule comprises:
acquiring a maximum operand in the target quadruple;
acquiring bracket information in the target arithmetic expression;
and determining that the label meeting the maximum operand and the bracket information in the preset division rule is the target label of the to-be-labeled question based on the maximum operand and the bracket information.
5. The method according to claim 1, wherein when the type information of the topic to be labeled is the application class, determining the target label of the topic to be labeled based on the type information of the topic to be labeled comprises:
acquiring target text information of the subject to be labeled, wherein the target text information at least comprises a subject stem and analysis of the subject to be labeled;
performing word segmentation and word stop processing on the target character information to obtain a word characteristic set;
calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;
training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
and inputting the questions to be labeled into the target classifier, and outputting the target labels of the questions to be labeled.
6. An apparatus for tagging a title, comprising:
the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining type information of a topic to be labeled, and the type information at least comprises a calculation class and an application class;
and the second determining module is used for determining the target label of the title to be labeled based on the type information of the title to be labeled.
7. The apparatus for labeling topics according to claim 6, wherein when the type information of the topic to be labeled is the cactus class, the second determining module comprises:
a first obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an answer of the to-be-labeled question;
the first determining unit is used for determining a target arithmetic expression based on the target character information;
the first processing unit is used for analyzing the target arithmetic expression based on a context-free grammar algorithm to obtain a target quadruplet;
the second determining unit is used for determining a target label of the to-be-labeled question based on the target quadruple, the target arithmetic expression and a preset division rule;
or the like, or, alternatively,
when the type information of the title to be labeled is the application class, the second determining module comprises:
a second obtaining unit, configured to obtain target text information of the to-be-labeled question, where the target text information at least includes a question stem and an analysis of the to-be-labeled question;
the second processing unit is used for carrying out word segmentation and word stop processing on the target character information to obtain a word characteristic set;
the calculation unit is used for calculating the ratio of the word frequency of each word feature in the word feature set to the inverse document frequency;
the third processing unit is used for training a preset classifier based on the ratio of the word frequency to the inverse document frequency to obtain a target classifier;
and the output unit is used for inputting the to-be-labeled question into the target classifier and outputting the target label of the to-be-labeled question.
8. The apparatus for labeling items according to claim 6, wherein said first determining unit comprises:
the first processing subunit is used for carrying out data cleaning on the target text information based on a regular expression to obtain de-noised text information;
the second processing subunit is used for extracting information from the de-noised text information to obtain the target arithmetic expression;
and/or the second determination unit comprises:
a first obtaining subunit, configured to obtain a maximum operand in the target quadruple;
the second acquisition subunit is used for acquiring the bracket information in the target arithmetic expression;
and the determining subunit is configured to determine, based on the maximum operand and the parenthesis information, that a tag in the preset partitioning rule, which meets the maximum operand and the parenthesis information, is a target tag of the to-be-tagged title.
9. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform the method for tagging titles according to any one of claims 1 to 5.
10. An electronic device comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory to perform the method of tagging topics according to any one of claims 1 to 5.
CN201911301442.9A 2019-12-17 2019-12-17 Method and device for labeling subjects, storage medium and electronic equipment Withdrawn CN112989760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911301442.9A CN112989760A (en) 2019-12-17 2019-12-17 Method and device for labeling subjects, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911301442.9A CN112989760A (en) 2019-12-17 2019-12-17 Method and device for labeling subjects, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112989760A true CN112989760A (en) 2021-06-18

Family

ID=76342214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911301442.9A Withdrawn CN112989760A (en) 2019-12-17 2019-12-17 Method and device for labeling subjects, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112989760A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250475A (en) * 2016-07-29 2016-12-21 广东小天才科技有限公司 The method for pushing of a kind of script and device
CN106599054A (en) * 2016-11-16 2017-04-26 福建天泉教育科技有限公司 Method and system for title classification and push
CN107301165A (en) * 2016-04-14 2017-10-27 科大讯飞股份有限公司 A kind of item difficulty analysis method and system
CN107894999A (en) * 2017-10-27 2018-04-10 成都准星云学科技有限公司 Towards the topic type automatic classification method and system based on thinking of solving a problem of elementary mathematics
KR20180073440A (en) * 2016-12-22 2018-07-02 주식회사 에스비앤컴퍼니 Apparatus and method for extracting similar test problem using recognition of test paper
CN108363699A (en) * 2018-03-21 2018-08-03 浙江大学城市学院 A kind of netizen's school work mood analysis method based on Baidu's mhkc

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301165A (en) * 2016-04-14 2017-10-27 科大讯飞股份有限公司 A kind of item difficulty analysis method and system
CN106250475A (en) * 2016-07-29 2016-12-21 广东小天才科技有限公司 The method for pushing of a kind of script and device
CN106599054A (en) * 2016-11-16 2017-04-26 福建天泉教育科技有限公司 Method and system for title classification and push
KR20180073440A (en) * 2016-12-22 2018-07-02 주식회사 에스비앤컴퍼니 Apparatus and method for extracting similar test problem using recognition of test paper
CN107894999A (en) * 2017-10-27 2018-04-10 成都准星云学科技有限公司 Towards the topic type automatic classification method and system based on thinking of solving a problem of elementary mathematics
CN108363699A (en) * 2018-03-21 2018-08-03 浙江大学城市学院 A kind of netizen's school work mood analysis method based on Baidu's mhkc

Similar Documents

Publication Publication Date Title
US10984032B2 (en) Relation extraction using co-training with distant supervision
CA3052527C (en) Target document template generation
US10902326B2 (en) Relation extraction using co-training with distant supervision
CN105389307A (en) Statement intention category identification method and apparatus
US10824812B2 (en) Method and apparatus for informative training repository building in sentiment analysis model learning and customization
CN106682192A (en) Method and device for training answer intention classification model based on search keywords
CN106970912A (en) Chinese sentence similarity calculating method, computing device and computer-readable storage medium
US10956469B2 (en) System and method for metadata correlation using natural language processing
CN107908616B (en) Method and device for predicting trend words
CN103106211B (en) Emotion recognition method and emotion recognition device for customer consultation texts
CN109597983A (en) A kind of spelling error correction method and device
CN111782793A (en) Intelligent customer service processing method, system and equipment
JP2011198203A (en) Document classifying device, document classifying method, program, and storage medium
US9563847B2 (en) Apparatus and method for building and using inference engines based on representations of data that preserve relationships between objects
CN116542800A (en) Intelligent financial statement analysis system based on cloud AI technology
Zhong et al. Natural language processing for systems engineering: automatic generation of systems modelling language diagrams
CN104881446A (en) Searching method and searching device
CN106033444B (en) Text content clustering method and device
US20200110834A1 (en) Dynamic Linguistic Assessment and Measurement
CN111881264B (en) Method and electronic equipment for searching long text in question-answering task in open field
CN116720489A (en) Page filling method and device, electronic equipment and computer readable storage medium
CN113742450B (en) Method, device, electronic equipment and storage medium for user data grade falling label
CN112989760A (en) Method and device for labeling subjects, storage medium and electronic equipment
Naik et al. An adaptable scheme to enhance the sentiment classification of Telugu language
CN115470790A (en) Method and device for identifying named entities in file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210618