WO2023108991A1 - Model training method and apparatus, knowledge classification method and apparatus, and device and medium - Google Patents

Model training method and apparatus, knowledge classification method and apparatus, and device and medium Download PDF

Info

Publication number
WO2023108991A1
WO2023108991A1 PCT/CN2022/090718 CN2022090718W WO2023108991A1 WO 2023108991 A1 WO2023108991 A1 WO 2023108991A1 CN 2022090718 W CN2022090718 W CN 2022090718W WO 2023108991 A1 WO2023108991 A1 WO 2023108991A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
answer
option
knowledge
vector
Prior art date
Application number
PCT/CN2022/090718
Other languages
French (fr)
Chinese (zh)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023108991A1 publication Critical patent/WO2023108991A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of machine learning, and in particular to a model training method, a knowledge classification method, a device, a device, and a medium.
  • machine reading comprehension technology can be used to give answers to questions.
  • Machine reading comprehension is a technology that enables machines to understand natural language texts and answer corresponding answers given questions and documents. This technology can be applied in many fields such as text question answering, information extraction in knowledge graph and event graph, and dialogue system.
  • the embodiment of the present application proposes a training method for a knowledge classification model, and the training method for the knowledge classification model includes:
  • the original annotation data includes question stem data, option data and answer data;
  • the preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  • the embodiment of the present application proposes a knowledge classification method for multiple-choice questions, and the knowledge classification method for multiple-choice questions includes:
  • the multiple-choice data includes question stem data
  • the stem characterization vector into a knowledge classification model; wherein, the knowledge classification model is obtained by training according to the method described in the first aspect above;
  • Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
  • the embodiment of the present application proposes a training device for a knowledge classification model, and the training device for the knowledge classification model includes:
  • the original data acquisition module is used to obtain the original annotation data;
  • the original annotation data includes question stem data, option data and answer data;
  • a question stem coding module configured to encode the question stem data to obtain a question stem representation vector
  • the option answer encoding module is used to encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
  • a word segmentation and splicing module used to perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain an option answer representation vector
  • a vector splicing module configured to splice the question stem representation vector and the option answer representation vector to obtain topic data
  • the classification model training module is used to train the preset pre-training model according to the topic data to obtain the knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  • the embodiment of the present application proposes a knowledge classification device for multiple-choice questions, and the knowledge classification device for multiple-choice questions includes:
  • the multiple-choice data acquisition module is used to obtain multiple-choice data to be classified; wherein, the multiple-choice data includes question stem data, option data and answer data;
  • a data input module configured to input the data of the multiple-choice questions into the knowledge classification model; wherein, the knowledge classification model is trained according to the method described in the first aspect above;
  • a feature extraction module configured to perform feature extraction on the multiple-choice question data through the knowledge classification model to obtain feature vector information
  • the knowledge classification module is configured to perform knowledge classification processing according to the feature vector information to obtain knowledge point types.
  • the embodiment of the present application proposes a computer device, including:
  • the program is stored in the memory, and the processor executes the at least one program to implement a knowledge classification model training method or a multiple-choice knowledge classification method
  • the knowledge classification model training method includes: obtaining Original labeling data; wherein, the original labeling data includes question stem data, option data and answer data; encoding the question stem data to obtain question stem representation vectors; Carry out encoding processing to obtain the option attribute value and the answer attribute value; perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain an option answer representation vector; represent the question stem representation vector and the option answer
  • the vectors are spliced to obtain topic data; the preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain knowledge points type.
  • the knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem representation vectors; The stem representation vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the question stem data through the knowledge classification model to obtain feature vector information; Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
  • the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute
  • a method for training a knowledge classification model or a method for classifying knowledge for multiple-choice questions wherein the method for training the knowledge classification model includes: obtaining original label data; wherein the original label data includes question stem data, option data and answer data; encoding the question stem data to obtain a question stem representation vector; encoding the option data and answer data according to a preset knowledge graph to obtain an option attribute value and an answer attribute value; encoding the option attribute value Perform word segmentation and splicing processing with the answer attribute value to obtain the option answer characterization vector; carry out vector splicing of the question stem characterization vector and the option answer characterization vector to obtain topic data;
  • the training model is trained to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on target topics to obtain knowledge point
  • the knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem representation vectors; The stem representation vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the question stem data through the knowledge classification model to obtain feature vector information; Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
  • the training method of the knowledge classification model proposed in the embodiment of the present application can be used for the target topic Carrying out knowledge classification processing to obtain knowledge point types that meet requirements can improve the accuracy and efficiency of knowledge classification.
  • FIG. 1 is a flowchart of a training method for a knowledge classification model provided by an embodiment of the present disclosure
  • Fig. 2 is the flowchart of step 102 in Fig. 1;
  • Fig. 3 is a partial flowchart of the training method of the knowledge classification model provided by another embodiment
  • Fig. 4 is the flowchart of step 103 in Fig. 1;
  • Fig. 5 is a flowchart of step 104 in Fig. 1;
  • FIG. 6 is a flow chart of a knowledge classification method for multiple-choice questions provided by an embodiment of the present disclosure
  • FIG. 7 is a functional block diagram of a training device for a knowledge classification model provided by an embodiment of the present disclosure.
  • Fig. 8 is a functional block diagram of a multiple-choice knowledge classification method device provided by an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present disclosure.
  • Artificial Intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • Natural language processing uses computers to process, understand and use human languages (such as Chinese, English, etc.). NLP belongs to a branch of artificial intelligence and is an interdisciplinary subject between computer science and linguistics. Known as computational linguistics. Natural language processing includes syntax analysis, semantic analysis, text understanding, etc. Natural language processing is often used in technical fields such as machine translation, handwritten and printed character recognition, speech recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining. It involves language processing Related data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research and linguistics research related to language computing, etc.
  • Knowledge Graph It combines the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrology citation analysis, co-occurrence analysis and other methods, and uses the visual graph to display the subject visually.
  • the main goal of the knowledge map is to describe various entities and concepts that exist in the real world, as well as the strong relationship between them. We use relationships to describe the association between two entities.
  • Entity refers to something that is distinguishable and exists independently. Such as a certain person, a certain city, a certain plant, a certain commodity, etc. Everything in the world is made up of concrete things, which refer to entities. Entities are the most basic elements in knowledge graphs, and different entities have different relationships.
  • Semantic class A collection of entities with the same characteristics, such as countries, nations, books, computers, etc. Concepts mainly refer to collections, categories, object types, and types of things, such as people, geography, etc.
  • Relationship There is a certain relationship between entities and entities, between different concepts and concepts, and between concepts and entities.
  • the relation is formalized as a function that maps k points to a Boolean value.
  • a relation is a function that maps kk graph nodes (entities, semantic classes, attribute values) to Boolean values.
  • Attribute The value of an entity-specific attribute, which is the attribute value pointed from an entity to it. Different attribute types correspond to edges with different types of attributes.
  • the attribute value mainly refers to the value of the specified attribute of the object. For example: "area”, “population”, “capital” are several different attributes.
  • the attribute value mainly refers to the value of the specified attribute of the object, such as 9.6 million square kilometers, etc.
  • triple ( ⁇ E,R ⁇ ) is a general representation of knowledge graph; the basic form of triple mainly includes (entity 1-relationship-entity 2) and (entity-attribute-attribute value) wait.
  • Each entity (the extension of the concept) can be identified by a globally unique ID
  • each attribute-value pair (AVP) can be used to describe the intrinsic characteristics of the entity, and the relationship can be used to connect two entities. the connection between them.
  • AVP attribute-value pair
  • Beijing is an entity
  • population is a Attributes
  • 20.693 million are attribute values.
  • Beijing-population-20.693 million constitutes an example triplet of (entity-attribute-attribute value).
  • token is the basic unit of indexing, representing each indexed character; if a field is tokenized, it means that the field has passed an analysis program that can convert the content into a token string; in the process of tokenization , the parser applies any transformation logic (such as removing stop words such as "a” or "the”, performing a stemming search, converting all text without case sensitivity to lowercase, etc.), the extraction should be compiled Text content to be indexed.
  • BERT Bidirectional Encoder Representation from Transformers
  • the BERT model further increases the generalization ability of the word vector model, fully describes the character-level, word-level, sentence-level and even inter-sentence relationship features, and is built based on Transformer.
  • AI artificial intelligence
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • an embodiment of the present disclosure provides a training method for a knowledge classification model, a knowledge classification method for multiple-choice questions, a training device for a knowledge classification model, a knowledge classification device for multiple-choice questions, a computer device, and a storage medium, which can improve the ability of the model to classify knowledge. accuracy and efficiency.
  • the knowledge classification model training method, knowledge classification method for multiple-choice questions, training device for knowledge classification model, knowledge classification device for multiple-choice questions, computer equipment, and storage media provided by the embodiments of the present disclosure will be specifically described through the following embodiments.
  • the training method of the knowledge classification model in the embodiment of the present disclosure will be specifically described through the following embodiments.
  • the training method of the knowledge classification model provided by the embodiment of the present disclosure relates to the technical field of machine learning.
  • the training method of the knowledge classification model provided by the embodiments of the present disclosure may be applied to a terminal, may also be applied to a server, and may also be software running on the terminal or the server.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch;
  • the server end can be configured as an independent physical server, or as a server cluster composed of multiple physical servers or as a distributed
  • the system can also be configured to provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the cloud server of the service; the software can be the application of the training method for realizing the knowledge classification model, etc., but it is not limited to the above forms.
  • FIG. 1 is an optional flow chart of a method for training a knowledge classification model provided by an embodiment of the present disclosure.
  • the method in FIG. 1 may include but not limited to steps 101 to 106 .
  • Step 101 obtaining original annotation data;
  • the original annotation data includes question stem data, option data and answer data;
  • Step 102 encoding the question stem data to obtain a question stem representation vector
  • Step 103 Encoding the option data and answer data according to the preset knowledge map to obtain option attribute values and answer attribute values;
  • Step 104 performing word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain the option answer representation vector
  • Step 105 performing vector concatenation of the question stem representation vector and the option answer representation vector to obtain question data
  • Step 106 Train the preset pre-training model according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  • step 101 of an application scenario it is necessary to obtain a certain amount of original labeling data, for example, 1 million pieces of original labeling data.
  • the original labeling data may be manually labeled topic data.
  • the type of knowledge points investigated by the topic that is, the label of the original labeled data is the type of knowledge point.
  • the type of knowledge point investigated in [attributive clause] is an attributive clause
  • the type of knowledge point investigated in [adverbial clause] is an adverbial clause.
  • 1 million labeled data are used to train the model, so tens of millions or even more English questions can be automatically classified at the cost of only 1 million data.
  • the original annotation data is the question stem data, option data and answer data of English multiple-choice questions.
  • the question stem representation vector is obtained, and the option data and answer in the original annotation data are analyzed according to the preset knowledge graph.
  • the data is encoded, so that the option attribute value and the answer attribute value can be obtained, and then the option attribute value and the answer attribute value are subjected to word segmentation and splicing processing to obtain the option answer representation vector, and then the question stem representation vector and the option answer.
  • the characterization vectors are spliced to obtain the topic data, and finally the preset pre-training model is trained according to the topic data to obtain a knowledge classification model, which can be used to perform knowledge classification processing on the target topic to obtain knowledge points type, the knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.
  • the question stem data is encoded to obtain the question stem representation vector, specifically including:
  • Step 201 preprocessing the question stem data to obtain a preliminary question stem sequence
  • Step 202 perform word segmentation processing on the preliminary question stem sequence to obtain a question stem representation vector.
  • step 201 includes:
  • the English content of the question stem data includes: I lOVE YOU, all I lOVE YOU are converted into lowercase, and the obtained preliminary question stem sequence is: i love you.
  • step 201 also includes:
  • the English abbreviated content of the question stem data is restored to the English full name, and the preliminary question stem sequence is obtained.
  • the preliminary question stem sequence obtained after restoring the I'm containing the English abbreviation to the English full name is: i am.
  • step 202 word segmentation processing is performed on the preliminary question stem sequence to obtain a question stem representation vector, specifically including:
  • the preliminary stem sequence is:
  • the stem representation vector obtained after tokenizing i am playing is:
  • the training method of the knowledge classification model also includes: building a knowledge map, which may specifically include but not limited to steps 301 to 303:
  • Step 301 acquiring preset knowledge points
  • Step 302 constructing a first triplet and a second triplet according to preset knowledge points
  • Step 303 constructing a knowledge graph based on the first triple and the second triple; wherein, the first triple includes the first knowledge entity, relationship, and second knowledge entity, and the second triple includes the second knowledge entity , attribute, attribute value.
  • step 301 of some embodiments technical means such as a web crawler may be used to crawl relevant data such as preset knowledge points; relevant data may also be obtained from a preset database.
  • relevant data such as preset knowledge points; relevant data may also be obtained from a preset database.
  • the preset knowledge points are preset English knowledge points, such as English test points in the English online education scenario.
  • the principle of constructing the English knowledge map is: constructing the first triplet and the second triplet according to each knowledge point of the preset knowledge points, wherein the first triplet includes the first knowledge Entity, relation, second knowledge entity, the second triple group includes second knowledge entity, attribute, attribute value.
  • the association relationship between the first knowledge entity and the second knowledge entity is established, specifically, the connection of the association relationship between the first knowledge entity and the second knowledge entity is established through an undirected edge.
  • Explanation on the first triple if there is a relationship between two knowledge nodes, then the two knowledge nodes with the relationship are connected together by an undirected edge.
  • the knowledge node is called an entity, and the undirected edge represents The relationship between the two knowledge nodes, in the embodiment of the present disclosure, the two knowledge nodes correspond to the first knowledge entity and the second knowledge entity.
  • the second knowledge entity represents the name of the corresponding English knowledge point
  • the second triplet represents: the name of the corresponding English knowledge point, the attribute of the English knowledge point, and the attribute value corresponding to the attribute .
  • the first triple can be expressed as: clause-include-attributive clause; or the first triple can be expressed as: clause-include-adverbial clause; where [clause] is the corresponding English knowledge point , this English knowledge point includes [attributive clause] and [adverbial clause] two knowledge points, and the internal relationship is containment.
  • the second triple can be expressed as: attributive clause-grade-grade 8, attributive clause-relative word-which; among them, the [attributive clause] has an attribute of [grade], and the [grade]
  • the attribute value of is [Grade 8], which means that the [attributive clause] is a knowledge point of [Grade 8].
  • the [attributive clause] also has an attribute value of [relative word], and the attribute value of this [relative word] is which.
  • the first triple can be expressed as: clause-include-attributive clause; or the first triple can be expressed as: clause-include-adverbial clause; where [clause] is the corresponding English knowledge point , this English knowledge point includes [attributive clause] and [adverbial clause] two knowledge points, and the internal relationship is containment.
  • the composition structure of English knowledge points and the inspection points of English knowledge points can be clearly known; in addition, the sum of edges between two knowledge points can be calculated to Whether two knowledge points are similar knowledge points can be judged with reference to related technologies, which is not limited in the embodiments of the present disclosure.
  • the preset knowledge graph includes the first triplet and the second triplet, and the option data and answer data are encoded according to the preset knowledge graph to obtain the option Attribute values and answer attribute values, which may specifically include but are not limited to include:
  • the knowledge graph includes a first triplet and multiple second triplets
  • the option data and answer data are encoded according to a preset knowledge graph to obtain option attribute values and answer attribute values, including:
  • Step 401 Encoding option data according to the first triplet and multiple second triplets to obtain option attribute values; wherein, the option attribute value includes attribute values of multiple second triplets;
  • Step 402 Encode the answer data according to the first triplet and one of the second triplets to obtain the answer attribute value; wherein, the answer attribute value is one of the multiple attribute values in the option attribute value .
  • the embodiment of the present disclosure introduces the knowledge information of the knowledge map to the encoding stage of the options and answers.
  • the options and answers of the questions are used to obtain knowledge entities through the relevant information of the first triplet and the second triplet of the knowledge graph.
  • take an English multiple-choice question as an example.
  • a sentence containing clause content is given in the question stem data: My house, which I bought last year, has got a stylish garden.
  • the question stem data it is required to judge the clause type of the clause "which I bought last year”.
  • the option data is: A, B, C, D four options, where option A is an adverbial clause, option B is a main clause, option C is an attributive clause, and option D is an predicative clause.
  • option A is an adverbial clause
  • option B is a main clause
  • option C is an attributive clause
  • option D is an predicative clause.
  • the first triple of the knowledge map is expressed as: clause-contains-attributive clause
  • the second triple is: attributive clause-relative word-which.
  • the "which" in the clause "which I bought last year” is a relative word
  • the type of the corresponding clause is "attributive clause", which is the expression of the second triple: attributive clause-relative word-which.
  • the answer corresponding to the type of the clause "which I bought last year” is: the clause is a defining clause, and the answer corresponds to the expression of the first triple: clause-contains-attributive clause.
  • the option data is encoded according to the first triplet and multiple second triplets, and the obtained option attribute values are: adverbial clause, subject clause, attributive clause, and predicative clause.
  • the answer data is encoded according to the first triplet and one of the second triplets, and the obtained answer attribute value is: attributive clause (that is, the attributive clause in the option attribute value); in this application scenario, the English knowledge investigated The point is the judgment of the attributive clause in the clause.
  • step 104 of some embodiments word segmentation and splicing are performed on the option attribute value and the answer attribute value to obtain the option answer representation vector, which may specifically include but not limited to include:
  • Step 501 perform word vectorization on the option attribute value and answer attribute value, and obtain the option attribute value and answer attribute value of word vectorization;
  • Step 502 concatenate the option attribute values and answer attribute values quantized to obtain option answer representation vectors.
  • the knowledge words corresponding to the option attribute value and the answer attribute value are vectorized into a vector token corresponding to the option attribute value and a vector token corresponding to the answer attribute value, and then the two vectors The tokens are spliced to obtain the option answer representation vector.
  • the attribute value of the option and the attribute value of the answer can be spliced first to obtain the attribute value of the option answer, and then the attribute value of the option answer is vectorized into a vector token corresponding to the option answer, that is, the option answer representation vector.
  • the option attribute value is a sequence of sentences A
  • the answer attribute value is a sentence B
  • the two sentences A and B are concatenated into an option answer representation vector.
  • the option answer characterization vector can be a sequence with a length of 320; if the length of the option answer characterization vector is not 320, the option answer characterization vector needs to be zero-filled; and because the option attribute value may be very long, Therefore, it is necessary to truncate the attribute value of the option, and cut off the tail of a longer sentence each time until the length of the entire option answer representation vector is 320.
  • the question stem data is given a clause content, and it is required to judge the clause type of the clause content.
  • the options are A, B, C, and D.
  • Option A is an adverbial Dependent clauses
  • option B is the main clause
  • option C is the attributive clause
  • option D is the predicative clause
  • the answer data corresponds to: attributive clause; that is, the option attribute value includes the adverbial clause, the subject clause, the attributive clause, and the predicative clause
  • the answer attribute value as an attributive clause. Therefore, the option answer representation vector obtained after word segmentation and concatenation of the option attribute value and the answer attribute value is expressed as [adverbial clause, subject clause, attributive clause, predicative clause, attributive clause].
  • the question stem characterization vector and the option answer characterization vector are vector concatenated to obtain question data, which may specifically include but not limited to include:
  • the question stem representation vector and the option answer representation vector are vector-spliced through separators to obtain the question data.
  • the delimiter can be a pair of placeholders: a first placeholder [CLS] and a second placeholder [SEP], wherein the first placeholder [CLS] represents the beginning of the sequence, and the second The placeholder [SEP] indicates the end of the sequence.
  • CLS classifer token
  • SEP sentence separator
  • a character is also a special token that can be used to separate two sentences.
  • the question stem characterization vector and the option answer characterization vector are vector-spliced through the separator to obtain the question data, including:
  • the question stem characterization vector is set between the first placeholder and the second placeholder, the second placeholder is set between the question stem characterization vector and the option answer characterization vector, and the question stem characterization vector and the option answer characterization vector Perform vector splicing to obtain the topic data.
  • the representation form of the question data is: [ ⁇ CLS>, question stem representation vector, ⁇ SEP>, option answer representation vector]
  • the stem representation vector is: i, am, play, ing
  • the option answer representation vector is: [adverbial clause, subject clause, attributive clause, predicative clause, attributive clause]
  • the preset pre-training model can be a BERT model; specifically, according to the topic data obtained in step 105 as the input of the BERT model, the BERT model is trained to obtain a knowledge classification model, the knowledge classification model
  • the basic framework of BERT is the BERT model; the knowledge classification model is used to predict the knowledge type of the target topic; specifically, the knowledge classification model includes a softmax classifier; the knowledge classification model obtains the feature vector information corresponding to ⁇ CLS> according to the input topic data , ⁇ CLS> can predict the knowledge type of the target topic after passing through a softmax classifier.
  • the target topic is a topic input into the knowledge classification model, for example, it may be a multiple-choice topic, and more specifically, in the case of an English multiple-choice question, the target topic may be a multiple-choice question examining attributive clauses.
  • each token-level word it includes: token embedding, position embedding, and segment embedding; wherein the token embedding is a vector representation of the word on the entire corpus obtained by the token after the model is pre-trained on the corpus;
  • the positional embedding is the position index of the current token in the sequence;
  • the segmental embedding is to mark whether it is sentence A or sentence B in this sequence, where the segmental embedding of the token belonging to sentence A is 0, and the segmental embedding of the token belonging to sentence B is 1.
  • the three embeddings of token embedding, position embedding, and segment embedding are spliced together to form the word embedding of each token, and the embedding of the entire sequence is input into the multi-layer bidirectional Transformer encoder, and the first one of the last hidden layer is taken
  • the vector corresponding to the token (namely [CLS]) is used as the aggregate representation of the entire sentence, that is, the vector represents the vector representation of the entire option sequence.
  • the knowledge type of the topic can be predicted by passing the sequence represented by the topic data through the softmax classifier.
  • the question stem representation vector is obtained, and the option data and answer data in the original annotation data are processed according to the preset knowledge graph. Encoding processing, so that the option attribute value and the answer attribute value can be obtained, and then the option attribute value and the answer attribute value are subjected to word segmentation and splicing processing to obtain the option answer representation vector, and then the question stem representation vector and the option answer representation vector Perform vector splicing to obtain topic data, and finally train the preset pre-training model according to the topic data to obtain a knowledge classification model.
  • This knowledge classification model can be used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  • the knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.
  • the topics of English multiple-choice questions are classified, and the model can be used to automatically distinguish the knowledge points investigated by the questions.
  • the technical solutions of the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification, and by introducing the knowledge map coding information (triple information) of options and answers, it is possible to more accurately predict the content of the topic. knowledge type. With a fixed cost of labeled samples, new topics can be classified more efficiently.
  • the embodiment of the present disclosure also provides a knowledge classification method for multiple-choice questions.
  • the knowledge classification method for multiple-choice questions provided by the embodiment of the present disclosure relates to the technical field of machine learning.
  • the multiple-choice knowledge classification method provided by the embodiments of the present disclosure can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch;
  • the server end can be configured as an independent physical server, or as a server cluster composed of multiple physical servers or as a distributed
  • the system can also be configured to provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the cloud server of the service; the software can be the application of knowledge classification methods to realize multiple-choice questions, but it is not limited to the above forms.
  • Fig. 6 is an optional flow chart of the multiple-choice knowledge classification method provided by the embodiment of the present disclosure.
  • the method in Fig. 6 may include but not limited to steps 601 to 604:
  • Step 601. Obtain multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data, option data and answer data;
  • Step 602 input multiple-choice question data into the knowledge classification model; wherein, the knowledge classification model is obtained through training according to the method of the first aspect above;
  • Step 603 perform feature extraction on the multiple-choice question data through the knowledge classification model, and obtain feature vector information
  • Step 604 performing knowledge classification processing according to the feature vector information to obtain knowledge point types.
  • multiple-choice question data to be classified includes question stem data, option data and answer data.
  • the multiple choice data is different from the original label data: the original label data includes knowledge point types, and the multiple choice data does not include knowledge point types.
  • target questions include multiple-choice question data to be classified.
  • the knowledge classification model includes a softmax classifier.
  • the knowledge classification method for multiple-choice questions feature extraction is performed on the data of multiple-choice questions through the knowledge classification model, and the feature vector information corresponding to ⁇ CLS> is obtained, and the obtained feature vector information includes question stem representation vectors and option answer representation vectors;
  • the The question stem characterization vector is the same as the question stem characterization vector in the above-mentioned knowledge classification model training method, that is, the question stem characterization vector in this embodiment is set between the first placeholder ⁇ CLS> and the second placeholder ⁇ SEP> , it can also be said that the question stem representation vector includes the first placeholder ⁇ CLS>;
  • the knowledge classification method of the multiple-choice question in this embodiment is the same as the training method of the above-mentioned knowledge classification model and also includes: the second placeholder ⁇ SEP> set Between the question stem representation vector and the option answer representation vector, it can also be said that the option answer representation vector includes a second placeholder ⁇ SEP>.
  • step 604 of some embodiments according to the feature vector information corresponding to ⁇ CLS> obtained in step 603, through a softmax classifier, the softmax classifier can perform word count classification processing according to the feature vector information corresponding to ⁇ CLS>, thereby predicting The knowledge type of the topic.
  • the question stem representation vector is obtained, and the option data and answer data in the original annotation data are processed according to the preset knowledge graph. Encoding processing, so that the option attribute value and the answer attribute value can be obtained, and then the option attribute value and the answer attribute value are subjected to word segmentation and splicing processing to obtain the option answer representation vector, and then the question stem representation vector and the option answer representation vector Perform vector splicing to obtain topic data, and finally train the preset pre-training model according to the topic data to obtain a knowledge classification model.
  • This knowledge classification model can be used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  • the knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.
  • the topics of English multiple-choice questions are classified, and the model can be used to automatically distinguish the knowledge points investigated by the questions.
  • the technical solutions of the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification, and by introducing the knowledge map coding information (triple information) of options and answers, it is possible to more accurately predict the content of the topic. knowledge type. With a fixed cost of labeled samples, new topics can be classified more efficiently.
  • an embodiment of the present disclosure also provides a training device for a knowledge classification model, which can implement the above-mentioned training method for the knowledge classification model.
  • the training device for the knowledge classification model includes: an original data acquisition module for obtaining original label data ;
  • the original labeling data includes question stem data, option data and answer data;
  • the question stem encoding module is used to encode the question stem data to obtain the question stem representation vector;
  • the option answer encoding module is used to obtain the question stem representation vector according to the preset knowledge
  • the map encodes the option data and the answer data to obtain the option attribute value and the answer attribute value;
  • the word segmentation and splicing module is used to perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain the option answer representation vector ;
  • a vector splicing module used for vector splicing the question stem representation vector and the option answer representation vector to obtain topic data;
  • a classification model training module used for training a preset pre-training model according to the topic data
  • the knowledge classification model training device in the embodiment of the present disclosure is used to execute the knowledge classification model training method in the above embodiment, and its specific processing process is the same as the knowledge classification model training method in the above embodiment, which will not be repeated here. A repeat.
  • an embodiment of the present disclosure also provides a knowledge classification device for multiple-choice questions, which can realize the knowledge classification method for the above-mentioned multiple-choice questions.
  • the knowledge classification device for multiple-choice questions includes: a data acquisition module for multiple-choice questions, used to obtain Multiple-choice question data; wherein, the multiple-choice question data includes question stem data, option data and answer data; the data input module is used to input the multiple-choice question data into the knowledge classification model; wherein, the knowledge classification model is the knowledge according to the above-mentioned first aspect
  • the training method of the classification model is trained; the feature extraction module is used to extract the features of the multiple choice data through the knowledge classification model to obtain the feature vector information; the knowledge classification module is used to perform knowledge classification processing according to the feature vector information to obtain the knowledge point type .
  • the knowledge classification device for multiple-choice questions in the embodiment of the present disclosure is used to implement the knowledge classification method for multiple-choice questions in the above-mentioned embodiments, and its specific processing process is the same as the knowledge classification method for multiple-choice questions in the above-mentioned embodiments, and will not be repeated here. repeat.
  • An embodiment of the present disclosure also provides a computer device, including:
  • the program is stored in the memory, and the processor executes the at least one program to implement the above-mentioned knowledge classification model training method or multiple choice question knowledge classification method in the present disclosure.
  • the computer device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (PDA for short), a vehicle-mounted computer, and the like.
  • FIG. 9 illustrates a hardware structure of a computer device in another embodiment, and the computer device includes:
  • the processor 701 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to Realize the technical solutions provided by the embodiments of the present disclosure;
  • a general-purpose CPU Central Processing Unit, central processing unit
  • a microprocessor an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to Realize the technical solutions provided by the embodiments of the present disclosure
  • ASIC Application Specific Integrated Circuit
  • the memory 702 may be implemented in the form of a ROM (ReadOnly Memory, read only memory), a static storage device, a dynamic storage device, or a RAM (Random Access Memory, random access memory).
  • the memory 702 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 702 and called by the processor 701 to execute the implementation of the present disclosure.
  • the knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem characterization vectors; The vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the stem data through the knowledge classification model to obtain feature vector information; according to the The feature vector information is processed by knowledge classification to obtain the type of knowledge points.
  • the input/output interface 703 is used to realize information input and output
  • the communication interface 704 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
  • the processor 701 , the memory 702 , the input/output interface 703 and the communication interface 704 are connected to each other within the device through the bus 705 .
  • An embodiment of the present disclosure also provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make the computer execute the above-mentioned knowledge classification model training method or multiple-choice knowledge classification method; wherein, the knowledge classification model training method includes: obtaining the original Labeling data; wherein, the original labeling data includes question stem data, option data, and answer data; the question stem data is encoded to obtain the question stem representation vector; the option data and answer data are encoded according to the preset knowledge map to obtain option attribute value and answer attribute value; the option attribute value and answer attribute value are segmented and spliced to obtain the option answer representation vector; the question stem representation vector and the option answer representation vector are vector spliced to obtain the topic data; according to the topic data
  • the preset pre-training model is trained to obtain a knowledge classification model; wherein, the knowledge classification model is
  • the knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem characterization vectors; The vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the stem data through the knowledge classification model to obtain feature vector information; according to the The feature vector information is processed by knowledge classification to obtain the type of knowledge points.
  • the training method of the knowledge classification model, the knowledge classification method of the multiple choice questions, the training device of the knowledge classification model, the knowledge classification device of the multiple choice questions, the computer equipment, and the storage medium proposed by the embodiments of the present disclosure obtain the original labeling data, and the original labeling
  • the question stem data in the data is encoded to obtain the question stem representation vector
  • the option data and answer data in the original annotation data are encoded according to the preset knowledge map, so that the option attribute value and answer attribute value can be obtained, and then Segment and concatenate the option attribute value and the answer attribute value to obtain the option answer characterization vector, and then perform vector splicing on the question stem characterization vector and the option answer characterization vector, so that the topic data can be obtained, and finally according to the topic data.
  • the preset pre-training model is trained to obtain a knowledge classification model.
  • the knowledge classification model can be used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  • the knowledge classification model obtained in the embodiment of the present disclosure can improve the accuracy of knowledge classification. accuracy and efficiency.
  • the topics of English multiple-choice questions are classified, and the model can be used to automatically distinguish the knowledge points investigated by the questions.
  • the technical solutions of the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification, and by introducing the knowledge map coding information (triple information) of options and answers, it is possible to more accurately predict the content of the topic. knowledge type. With a fixed cost of labeled samples, new topics can be classified more efficiently.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • FIGS. 1-6 do not constitute a limitation to the embodiments of the present disclosure, and may include more or fewer steps than those shown in the illustrations, or combine certain steps, or be different. A step of.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application belongs to the technical field of machine learning. Provided are a model training method and apparatus, a knowledge classification method and apparatus, and a device and a medium. The model training method comprises: acquiring original annotation data, wherein the original annotation data comprises question stem data, choice data and answer data; encoding the question stem data to obtain a question stem representation vector; encoding the choice data and the answer data according to a preset knowledge graph, so as to obtain a choice attribute value and an answer attribute value; performing word segmentation and splicing processing on the choice attribute value and the answer attribute value, so as to obtain a choice answer representation vector; performing vector splicing on the question stem representation vector and the choice answer representation vector, so as to obtain question data; and training a preset pre-training model according to the question data, so as to obtain a knowledge classification model, wherein the knowledge classification model is used for performing knowledge classification on a target question. The knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.

Description

模型的训练方法、知识分类方法、装置、设备、介质Model training method, knowledge classification method, device, equipment, medium
本申请要求于2021年12月15日提交中国专利局、申请号为202111536048.0,发明名称为“模型的训练方法、知识分类方法、装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111536048.0 submitted to the China Patent Office on December 15, 2021, and the invention title is "model training method, knowledge classification method, device, equipment, medium", the entire content of which Incorporated in this application by reference.
技术领域technical field
本申请涉及机器学习技术领域,尤其涉及模型的训练方法、知识分类方法、装置、设备、介质。The present application relates to the technical field of machine learning, and in particular to a model training method, a knowledge classification method, a device, a device, and a medium.
背景技术Background technique
随着人工智能技术的发展,目前可以基于人工智能技术来处理多种数据的技术方案,例如,可以利用机器阅读理解技术给出问题的答案。机器阅读理解,是一种使机器理解自然语言文本,在给定问题与文档的情况下,回答出相应答案的技术。该技术能够在文本问答、知识图谱和事件图谱中的信息抽取、对话系统等诸多领域应用。With the development of artificial intelligence technology, there are currently technical solutions that can process various data based on artificial intelligence technology. For example, machine reading comprehension technology can be used to give answers to questions. Machine reading comprehension is a technology that enables machines to understand natural language texts and answer corresponding answers given questions and documents. This technology can be applied in many fields such as text question answering, information extraction in knowledge graph and event graph, and dialogue system.
技术问题technical problem
以下是发明人意识到的现有技术的技术问题:The following are the technical problems of the prior art that the inventors are aware of:
在一些应用场景,缺乏对知识进行分类的技术方案,例如在英语在线教育场景中,需要对考察相关英语知识点的题目进行分类,从而把相同知识点的题目进行划分,对用户进行专项训练。由于英语题目的数量过于庞大,而且每年都会研发一些新题目;若依靠人工对每道题进行划分,工作量大、效率低、容易出错。In some application scenarios, there is a lack of technical solutions for classifying knowledge. For example, in the English online education scenario, it is necessary to classify the topics for examining relevant English knowledge points, so as to divide the topics of the same knowledge points and provide special training for users. Since the number of English questions is too large, and some new questions are developed every year; if each question is divided manually, the workload is heavy, the efficiency is low, and it is easy to make mistakes.
技术解决方案technical solution
第一方面,本申请实施例提出了一种知识分类模型的训练方法,所述知识分类模型的训练方法包括:In the first aspect, the embodiment of the present application proposes a training method for a knowledge classification model, and the training method for the knowledge classification model includes:
获取原始标注数据;所述原始标注数据包括题干数据、选项数据和答案数据;Obtaining original annotation data; the original annotation data includes question stem data, option data and answer data;
对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;Segmenting and concatenating the option attribute value and the answer attribute value to obtain an option answer representation vector;
将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;performing vector splicing of the question stem characterization vector and the option answer characterization vector to obtain question data;
根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。The preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
第二方面,本申请实施例提出了一种选择题的知识分类方法,所述选择题的知识分类方法包括:In the second aspect, the embodiment of the present application proposes a knowledge classification method for multiple-choice questions, and the knowledge classification method for multiple-choice questions includes:
获取待分类的选择题数据;其中,所述选择题数据包括题干数据;Obtain multiple-choice data to be classified; wherein, the multiple-choice data includes question stem data;
对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
将所述题干表征向量输入至知识分类模型;其中,所述知识分类模型为根据上述第一方面所述的方法训练得到;Inputting the stem characterization vector into a knowledge classification model; wherein, the knowledge classification model is obtained by training according to the method described in the first aspect above;
通过所述知识分类模型对所述题干数据进行特征提取,得到特征向量信息;performing feature extraction on the stem data through the knowledge classification model to obtain feature vector information;
根据所述特征向量信息进行知识分类处理,得到知识点类型。Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
第三方面,本申请实施例提出了一种知识分类模型的训练装置,所述知识分类模型的训练装置包括:In the third aspect, the embodiment of the present application proposes a training device for a knowledge classification model, and the training device for the knowledge classification model includes:
原始数据获取模块,用于获取原始标注数据;原始标注数据包括题干数据、选项数据和答案数据;The original data acquisition module is used to obtain the original annotation data; the original annotation data includes question stem data, option data and answer data;
题干编码模块,用于对所述题干数据进行编码处理,得到题干表征向量;A question stem coding module, configured to encode the question stem data to obtain a question stem representation vector;
选项答案编码模块,用于根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;The option answer encoding module is used to encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
分词和拼接模块,用于将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;A word segmentation and splicing module, used to perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain an option answer representation vector;
向量拼接模块,用于将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;A vector splicing module, configured to splice the question stem representation vector and the option answer representation vector to obtain topic data;
分类模型训练模块,用于根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。The classification model training module is used to train the preset pre-training model according to the topic data to obtain the knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
第四方面,本申请实施例提出了一种选择题的知识分类装置,所述选择题的知识分类装置包括:In the fourth aspect, the embodiment of the present application proposes a knowledge classification device for multiple-choice questions, and the knowledge classification device for multiple-choice questions includes:
选择题数据获取模块,用于获取待分类的选择题数据;其中,所述选择题数据包括题干数据、选项数据和答案数据;The multiple-choice data acquisition module is used to obtain multiple-choice data to be classified; wherein, the multiple-choice data includes question stem data, option data and answer data;
数据输入模块,用于将所述选择题数据输入至知识分类模型;其中,所述知识分类模型为根据上述第一方面所述的方法训练得到;A data input module, configured to input the data of the multiple-choice questions into the knowledge classification model; wherein, the knowledge classification model is trained according to the method described in the first aspect above;
特征提取模块,用于通过所述知识分类模型对所述选择题数据进行特征提取,得到特征向量信息;A feature extraction module, configured to perform feature extraction on the multiple-choice question data through the knowledge classification model to obtain feature vector information;
知识分类模块,用于根据所述特征向量信息进行知识分类处理,得到知识点类型。The knowledge classification module is configured to perform knowledge classification processing according to the feature vector information to obtain knowledge point types.
第五方面,本申请实施例提出了提出了一种计算机设备,包括:In the fifth aspect, the embodiment of the present application proposes a computer device, including:
至少一个存储器;at least one memory;
至少一个处理器;at least one processor;
至少一个程序;at least one program;
所述程序被存储在存储器中,处理器执行所述至少一个程序以实现一种知识分类模型的训练方法或者一种选择题的知识分类方法,其中,所述知识分类模型的训练方法包括:获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;对所述题干数据进行编码处理,得到题干表征向量;根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。所述选择题的知 识分类方法包括:获取待分类的选择题数据;其中,所述选择题数据包括题干数据;对所述题干数据进行编码处理,得到题干表征向量;将所述题干表征向量输入至知识分类模型;其中,所述知识分类模型为根据上述的知识分类模型的训练方法训练得到;通过所述知识分类模型对所述题干数据进行特征提取,得到特征向量信息;根据所述特征向量信息进行知识分类处理,得到知识点类型。The program is stored in the memory, and the processor executes the at least one program to implement a knowledge classification model training method or a multiple-choice knowledge classification method, wherein the knowledge classification model training method includes: obtaining Original labeling data; wherein, the original labeling data includes question stem data, option data and answer data; encoding the question stem data to obtain question stem representation vectors; Carry out encoding processing to obtain the option attribute value and the answer attribute value; perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain an option answer representation vector; represent the question stem representation vector and the option answer The vectors are spliced to obtain topic data; the preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain knowledge points type. The knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem representation vectors; The stem representation vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the question stem data through the knowledge classification model to obtain feature vector information; Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
第六方面,本申请实施例提出了一种存储介质,该存储介质是计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种知识分类模型的训练方法或者一种选择题的知识分类方法,其中,所述知识分类模型的训练方法包括:获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;对所述题干数据进行编码处理,得到题干表征向量;根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。所述选择题的知识分类方法包括:获取待分类的选择题数据;其中,所述选择题数据包括题干数据;对所述题干数据进行编码处理,得到题干表征向量;将所述题干表征向量输入至知识分类模型;其中,所述知识分类模型为根据上述的知识分类模型的训练方法训练得到;通过所述知识分类模型对所述题干数据进行特征提取,得到特征向量信息;根据所述特征向量信息进行知识分类处理,得到知识点类型。In the sixth aspect, the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute A method for training a knowledge classification model or a method for classifying knowledge for multiple-choice questions, wherein the method for training the knowledge classification model includes: obtaining original label data; wherein the original label data includes question stem data, option data and answer data; encoding the question stem data to obtain a question stem representation vector; encoding the option data and answer data according to a preset knowledge graph to obtain an option attribute value and an answer attribute value; encoding the option attribute value Perform word segmentation and splicing processing with the answer attribute value to obtain the option answer characterization vector; carry out vector splicing of the question stem characterization vector and the option answer characterization vector to obtain topic data; The training model is trained to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on target topics to obtain knowledge point types. The knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem representation vectors; The stem representation vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the question stem data through the knowledge classification model to obtain feature vector information; Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
有益效果Beneficial effect
本申请实施例提出的知识分类模型的训练方法、选择题的知识分类方法、知识分类模型的训练装置、选择题的知识分类装置、计算机设备、存储介质,训练得到的知识分类模型可以对目标题目进行知识分类处理,以得到符合需求的知识点类型,可以提高对知识分类的准确性和效率。The training method of the knowledge classification model proposed in the embodiment of the present application, the knowledge classification method of multiple choice questions, the training device of knowledge classification model, the knowledge classification device of multiple choice questions, the computer equipment, the storage medium, the knowledge classification model that the training obtains can be used for the target topic Carrying out knowledge classification processing to obtain knowledge point types that meet requirements can improve the accuracy and efficiency of knowledge classification.
附图说明Description of drawings
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.
图1是本公开实施例提供的知识分类模型的训练方法的流程图;FIG. 1 is a flowchart of a training method for a knowledge classification model provided by an embodiment of the present disclosure;
图2是图1中的步骤102的流程图;Fig. 2 is the flowchart of step 102 in Fig. 1;
图3是另一实施例提供的知识分类模型的训练方法的部分流程图;Fig. 3 is a partial flowchart of the training method of the knowledge classification model provided by another embodiment;
图4是图1中的步骤103的流程图;Fig. 4 is the flowchart of step 103 in Fig. 1;
图5是图1中的步骤104的流程图;Fig. 5 is a flowchart of step 104 in Fig. 1;
图6是本公开实施例提供的选择题的知识分类方法的流程图;FIG. 6 is a flow chart of a knowledge classification method for multiple-choice questions provided by an embodiment of the present disclosure;
图7是本公开实施例提供的知识分类模型的训练装置的功能模块图;7 is a functional block diagram of a training device for a knowledge classification model provided by an embodiment of the present disclosure;
图8是本公开实施例提供的选择题的知识分类方法装置的功能模块图;Fig. 8 is a functional block diagram of a multiple-choice knowledge classification method device provided by an embodiment of the present disclosure;
图9是本公开实施例提供的计算机设备的硬件结构示意图。FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present disclosure.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
首先,对本申请中涉及的若干名词进行解析:First, analyze some nouns involved in this application:
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial Intelligence (AI): It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
自然语言处理(natural language processing,NLP):NLP用计算机来处理、理解以及运用人类语言(如中文、英文等),NLP属于人工智能的一个分支,是计算机科学与语言学的交叉学科,又常被称为计算语言学。自然语言处理包括语法分析、语义分析、篇章理解等。自然语言处理常用于机器翻译、手写体和印刷体字符识别、语音识别及文语转换、信息检索、信息抽取与过滤、文本分类与聚类、舆情分析和观点挖掘等技术领域,它涉及与语言处理相关的数据挖掘、机器学习、知识获取、知识工程、人工智能研究和与语言计算相关的语言学研究等。Natural language processing (NLP): NLP uses computers to process, understand and use human languages (such as Chinese, English, etc.). NLP belongs to a branch of artificial intelligence and is an interdisciplinary subject between computer science and linguistics. Known as computational linguistics. Natural language processing includes syntax analysis, semantic analysis, text understanding, etc. Natural language processing is often used in technical fields such as machine translation, handwritten and printed character recognition, speech recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining. It involves language processing Related data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research and linguistics research related to language computing, etc.
知识图谱(Knowledge Graph):是通过将应用数学、图形学、信息可视化技术、信息科学等学科的理论与方法与计量学引文分析、共现分析等方法结合,并利用可视化的图谱形象地展示学科的核心结构、发展历史、前沿领域以及整体知识架构达到多学科融合目的现代理论。知识图谱主要目标是用来描述真实世界中存在的各种实体和概念,以及他们之间的强关系,我们用关系去描述两个实体之间的关联。Knowledge Graph: It combines the theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrology citation analysis, co-occurrence analysis and other methods, and uses the visual graph to display the subject visually. The core structure, development history, frontier fields and overall knowledge structure of modern theory to achieve the purpose of multidisciplinary integration. The main goal of the knowledge map is to describe various entities and concepts that exist in the real world, as well as the strong relationship between them. We use relationships to describe the association between two entities.
实体(Entity):指具有可区别性且独立存在的某种事物。如某一个人、某一个城市、某一种植物等、某一种商品等等。世界万物有具体事物组成,此指实体。实体是知识图谱中的最基本元素,不同的实体间存在不同的关系。Entity: Refers to something that is distinguishable and exists independently. Such as a certain person, a certain city, a certain plant, a certain commodity, etc. Everything in the world is made up of concrete things, which refer to entities. Entities are the most basic elements in knowledge graphs, and different entities have different relationships.
概念:某一类实体的集合。Concept: A collection of entities of a certain type.
语义类(概念):具有同种特性的实体构成的集合,如国家、民族、书籍、电脑等。概念主要指集合、类别、对象类型、事物的种类,例如人物、地理等。Semantic class (concept): A collection of entities with the same characteristics, such as countries, nations, books, computers, etc. Concepts mainly refer to collections, categories, object types, and types of things, such as people, geography, etc.
关系(Relationship):实体与实体之间、不同的概念与概念之间、概念与实体之间存在的某种相互关系。关系形式化为一个函数,它把kk个点映射到一个布尔值。在知识图谱上,关系则是一个把kk个图节点(实体、语义类、属性值)映射到布尔值的函数。Relationship (Relationship): There is a certain relationship between entities and entities, between different concepts and concepts, and between concepts and entities. The relation is formalized as a function that maps k points to a Boolean value. On a knowledge graph, a relation is a function that maps kk graph nodes (entities, semantic classes, attribute values) to Boolean values.
属性(值):实体指定属性的值,是从一个实体指向它的属性值。不同的属性类型对应于不同类型属性的边。属性值主要指对象指定属性的值。例如:“面积”、“人口”、“首都”是几种不同的属性。属性值主要指对象指定属性的值,例如960万平方公里等。Attribute (value): The value of an entity-specific attribute, which is the attribute value pointed from an entity to it. Different attribute types correspond to edges with different types of attributes. The attribute value mainly refers to the value of the specified attribute of the object. For example: "area", "population", "capital" are several different attributes. The attribute value mainly refers to the value of the specified attribute of the object, such as 9.6 million square kilometers, etc.
三元组:三元组({E,R})是知识图谱的一种通用表示方式;三元组的基本形式主要包括(实体1-关系-实体2)和(实体-属性-属性值)等。每个实体(概念的外延)可用一个全局唯一确定的ID来标识,每个属性-属性值对(attribute-valuepair,AVP)可用来刻画实体的内在特性,而关系可用来连接两个实体,刻画它们之间的关联。例如,在一个知识图谱的例子中,中国是一个实体,北京是一个实体,中国-首都-北京是一个(实体-关系-实体)的三元组样例,北京是一个实体,人口是一种属性,2069.3万是属性值。北京-人口-2069.3万构成一个(实体-属性-属性值)的三元组样例。Triple: triple ({E,R}) is a general representation of knowledge graph; the basic form of triple mainly includes (entity 1-relationship-entity 2) and (entity-attribute-attribute value) wait. Each entity (the extension of the concept) can be identified by a globally unique ID, each attribute-value pair (AVP) can be used to describe the intrinsic characteristics of the entity, and the relationship can be used to connect two entities. the connection between them. For example, in a knowledge graph example, China is an entity, Beijing is an entity, China-capital-Beijing is a triplet example of (entity-relationship-entity), Beijing is an entity, and population is a Attributes, 20.693 million are attribute values. Beijing-population-20.693 million constitutes an example triplet of (entity-attribute-attribute value).
token:token是建立索引的基本单位,表示每个被编入索引的字符;如果一个字段被token化,表示这个字段经过了一个可将内容转化为tokens串的分析程序;在token化的过程中,分析程序会在使用任何转换逻辑(例如去掉"a”或"the"这类停用词,执行词干搜寻,将无大小写区分的所有文字转换成小写等)的同时,抽取应被编入索引的文本内容。token: token is the basic unit of indexing, representing each indexed character; if a field is tokenized, it means that the field has passed an analysis program that can convert the content into a token string; in the process of tokenization , the parser applies any transformation logic (such as removing stop words such as "a" or "the", performing a stemming search, converting all text without case sensitivity to lowercase, etc.), the extraction should be compiled Text content to be indexed.
BERT(Bidirectional Encoder Representation from Transformers)模型:BERT模型进一步增加词向量模型泛化能力,充分描述字符级、词级、句子级甚至句间关系特征,基于Transformer构建而成。BERT (Bidirectional Encoder Representation from Transformers) model: The BERT model further increases the generalization ability of the word vector model, fully describes the character-level, word-level, sentence-level and even inter-sentence relationship features, and is built based on Transformer.
BERT模型等大规模预训练模型在自然语言处理任务中取得了不错的成绩和得到业内认可。但是这些大规模预训练模型通常参数量巨大(例如BERT-base模型拥有1.1亿参数、BERT-large模型拥有3.4亿参数),对精调和线上部署带来了巨大的挑战,海量的参数使得这些模型在微调和部署时速度慢,计算成本大,对实时的应用造成了极大的延迟和容量限制,因此模型压缩意义重大。Large-scale pre-training models such as the BERT model have achieved good results in natural language processing tasks and have been recognized by the industry. However, these large-scale pre-training models usually have huge parameters (for example, the BERT-base model has 110 million parameters, and the BERT-large model has 340 million parameters), which brings great challenges to fine-tuning and online deployment. The massive parameters make these The speed of model fine-tuning and deployment is slow, and the calculation cost is high, which causes great delay and capacity limitation for real-time applications, so model compression is of great significance.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
在一些应用场景,例如英语在线教育场景中,需要对考察相关英语知识点的题目进行分类,从而把相同知识点的题目进行划分,对用户进行专项训练。由于英语题目的数量过于庞大,而且每年都会研发一些新题目;若依靠人工对每道题进行划分,工作量大、效率低、容易出错。In some application scenarios, such as English online education scenarios, it is necessary to classify the topics for examining relevant English knowledge points, so as to divide the topics of the same knowledge points and conduct special training for users. Since the number of English questions is too large, and some new questions are developed every year; if each question is divided manually, the workload is heavy, the efficiency is low, and it is easy to make mistakes.
基于此,本公开实施例提供一种知识分类模型的训练方法、选择题的知识分类方法、知识分类模型的训练装置、选择题的知识分类装置、计算机设备、存储介质,可以提高模型对知识分类的准确性和效率。Based on this, an embodiment of the present disclosure provides a training method for a knowledge classification model, a knowledge classification method for multiple-choice questions, a training device for a knowledge classification model, a knowledge classification device for multiple-choice questions, a computer device, and a storage medium, which can improve the ability of the model to classify knowledge. accuracy and efficiency.
本公开实施例提供的知识分类模型的训练方法、选择题的知识分类方法、知识分类模型 的训练装置、选择题的知识分类装置、计算机设备、存储介质,具体通过如下实施例进行说明,首先描述本公开实施例中的知识分类模型的训练方法。The knowledge classification model training method, knowledge classification method for multiple-choice questions, training device for knowledge classification model, knowledge classification device for multiple-choice questions, computer equipment, and storage media provided by the embodiments of the present disclosure will be specifically described through the following embodiments. The training method of the knowledge classification model in the embodiment of the present disclosure.
本公开实施例提供的知识分类模型的训练方法,涉及机器学习技术领域。本公开实施例提供的知识分类模型的训练方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务器端可以配置成独立的物理服务器,也可以配置成多个物理服务器构成的服务器集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现知识分类模型的训练方法的应用等,但并不局限于以上形式。The training method of the knowledge classification model provided by the embodiment of the present disclosure relates to the technical field of machine learning. The training method of the knowledge classification model provided by the embodiments of the present disclosure may be applied to a terminal, may also be applied to a server, and may also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch; the server end can be configured as an independent physical server, or as a server cluster composed of multiple physical servers or as a distributed The system can also be configured to provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The cloud server of the service; the software can be the application of the training method for realizing the knowledge classification model, etc., but it is not limited to the above forms.
图1是本公开实施例提供的知识分类模型的训练方法的一个可选的流程图,图1中的方法可以包括但不限于包括步骤101至步骤106。FIG. 1 is an optional flow chart of a method for training a knowledge classification model provided by an embodiment of the present disclosure. The method in FIG. 1 may include but not limited to steps 101 to 106 .
步骤101、获取原始标注数据;原始标注数据包括题干数据、选项数据和答案数据; Step 101, obtaining original annotation data; the original annotation data includes question stem data, option data and answer data;
步骤102、对题干数据进行编码处理,得到题干表征向量; Step 102, encoding the question stem data to obtain a question stem representation vector;
步骤103、根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Step 103: Encoding the option data and answer data according to the preset knowledge map to obtain option attribute values and answer attribute values;
步骤104、将选项属性值和答案属性值进行分词和拼接处理,得到选项答案表征向量; Step 104, performing word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain the option answer representation vector;
步骤105、将题干表征向量和选项答案表征向量进行向量拼接,得到题目数据; Step 105, performing vector concatenation of the question stem representation vector and the option answer representation vector to obtain question data;
步骤106、根据题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。Step 106: Train the preset pre-training model according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
具体地,在一应用场景的步骤101中,需获取一定数量的原始标注数据,例如100万条原始标注数据,该原始标注数据可以是经过人工标注好的题目数据,该原始标注数据中标注有题目考察的知识点类型,即原始标注数据的的标签为知识点类型,例如,考察【定语从句】的知识点类型为定语从句,考察【状语从句】的知识点类型为状语从句。本实施例,利用标注好的100万条数据进行训练模型,从而只需要100万条数据的成本,就可以对几千万甚至更多道英语题目进行自动分类了。Specifically, in step 101 of an application scenario, it is necessary to obtain a certain amount of original labeling data, for example, 1 million pieces of original labeling data. The original labeling data may be manually labeled topic data. The type of knowledge points investigated by the topic, that is, the label of the original labeled data is the type of knowledge point. For example, the type of knowledge point investigated in [attributive clause] is an attributive clause, and the type of knowledge point investigated in [adverbial clause] is an adverbial clause. In this embodiment, 1 million labeled data are used to train the model, so tens of millions or even more English questions can be automatically classified at the cost of only 1 million data.
更进一步地,在一些应用场景,例如英语在线教育的应用场景中,该原始标注数据是对英语选择题的题干数据、选项数据和答案数据。在该英语在线教育的应用场景中,需要对考察相关英语知识点的题目数据进行分类,从而把相同知识点的题目进行划分,对用户进行专项训练。由于题目数据的数量过于庞大,而且每年都会研发一些新题目;若依靠人工对每道题进行划分,工作量大、效率低、容易出错。因此,本公开实施例,通过获取原始标注数据,并对原始标注数据中的题干数据进行编码处理,得到题干表征向量,并根据预设的知识图谱对原始标注数据中的选项数据和答案数据进行编码处理,从而可以得到选项属性值和答案属性值,再将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量,再将题干表征向量和选项答案表征向量进行向量拼接,从而可以得到题目数据,最后根据题目数据对预设的预训练模型进行训练,得到知识分类模型,该知识分类模型可以用于对目标题目进行知识分类处理,以得到知识点类型,本公开实施例得到的知识分类模型可以提高对知识分类的准确性和效率。Furthermore, in some application scenarios, such as the application scenario of English online education, the original annotation data is the question stem data, option data and answer data of English multiple-choice questions. In the application scenario of English online education, it is necessary to classify the topic data for examining relevant English knowledge points, so as to divide the topics of the same knowledge point and conduct special training for users. Because the amount of question data is too large, and some new questions are developed every year; if you rely on manual division of each question, the workload is heavy, the efficiency is low, and it is easy to make mistakes. Therefore, in the embodiment of the present disclosure, by obtaining the original annotation data, and encoding the question stem data in the original annotation data, the question stem representation vector is obtained, and the option data and answer in the original annotation data are analyzed according to the preset knowledge graph. The data is encoded, so that the option attribute value and the answer attribute value can be obtained, and then the option attribute value and the answer attribute value are subjected to word segmentation and splicing processing to obtain the option answer representation vector, and then the question stem representation vector and the option answer The characterization vectors are spliced to obtain the topic data, and finally the preset pre-training model is trained according to the topic data to obtain a knowledge classification model, which can be used to perform knowledge classification processing on the target topic to obtain knowledge points type, the knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.
以一个英语单选题的场景为例进行说明,一道考察定语从句的单选题中,题干数据中给定了一个包含有从句内容的句子:My house,which I bought last year,has got a lovely garden。题干要求判断从句“which I bought last year”的从句类型。选项数据为:A、B、C、D四个选项,其中选项A为状语从句,选项B为主语从句,选项C为定语从句,选项D为表语从句。答案只有一个,答案数据对应为:定语从句。即该单选题的答案是选项C。Take an English multiple-choice question as an example. In a multiple-choice question that investigates attributive clauses, a sentence containing clauses is given in the question stem data: My house, which I bought last year, has got a lovely garden. The question stem requires judging the clause type of the clause "which I bought last year". The option data is: A, B, C, D four options, where option A is an adverbial clause, option B is a main clause, option C is an attributive clause, and option D is an predicative clause. There is only one answer, and the answer data corresponds to: attributive clause. That is, the answer to this multiple choice question is option C.
请参阅图2,在一些实施例的步骤102中对题干数据进行编码处理,得到题干表征向量,具体包括:Please refer to Fig. 2, in step 102 of some embodiments, the question stem data is encoded to obtain the question stem representation vector, specifically including:
步骤201、对题干数据进行预处理,得到初步题干序列;Step 201, preprocessing the question stem data to obtain a preliminary question stem sequence;
步骤202、对初步题干序列进行分词处理,得到题干表征向量。 Step 202, perform word segmentation processing on the preliminary question stem sequence to obtain a question stem representation vector.
在一具体应用场景,步骤201包括:In a specific application scenario, step 201 includes:
将题干数据的英文内容转为小写,得到初步题干序列。Convert the English content of the question stem data to lowercase to obtain the preliminary question stem sequence.
示例地,例如题干数据的英文内容包括:I lOVE YOU,则将I lOVE YOU全部转为小写,得到的初步题干序列为:i love you。For example, if the English content of the question stem data includes: I lOVE YOU, all I lOVE YOU are converted into lowercase, and the obtained preliminary question stem sequence is: i love you.
更进一步地,步骤201还包括:Further, step 201 also includes:
将题干数据的英文简写内容还原为英文全称,得到初步题干序列。The English abbreviated content of the question stem data is restored to the English full name, and the preliminary question stem sequence is obtained.
示例地,例如题干数据的英文简写内容为:i’m,则将该包含有英文简写的I’m还原为英文全称后得到的初步题干序列为:i am。For example, if the English abbreviation content of the question stem data is: i'm, then the preliminary question stem sequence obtained after restoring the I'm containing the English abbreviation to the English full name is: i am.
在一具体应用场景,步骤202中,对初步题干序列进行分词处理,得到题干表征向量,具体包括:In a specific application scenario, in step 202, word segmentation processing is performed on the preliminary question stem sequence to obtain a question stem representation vector, specifically including:
对初步题干序列进行token化,得到题干表征向量。在一些实施例,初步题干序列为:Tokenize the preliminary question stem sequence to obtain the question stem representation vector. In some embodiments, the preliminary stem sequence is:
i am playingi am playing
对i am playing进行token化后得到的题干表征向量为:The stem representation vector obtained after tokenizing i am playing is:
[i,am,play,ing][i,am,play,ing]
请参阅图3,在一些实施例,步骤103之前,知识分类模型的训练方法还包括:构建知识图谱,具体可以包括但不限于包括步骤301至步骤303:Please refer to FIG. 3 , in some embodiments, before step 103, the training method of the knowledge classification model also includes: building a knowledge map, which may specifically include but not limited to steps 301 to 303:
步骤301、获取预设知识点; Step 301, acquiring preset knowledge points;
步骤302、根据预设知识点构建第一三元组和第二三元组; Step 302, constructing a first triplet and a second triplet according to preset knowledge points;
步骤303、根得第一三元组和第二三元组构建知识图谱;其中,第一三元组包括第一知识实体、关系、第二知识实体,第二三元组包括第二知识实体、属性、属性值。 Step 303, constructing a knowledge graph based on the first triple and the second triple; wherein, the first triple includes the first knowledge entity, relationship, and second knowledge entity, and the second triple includes the second knowledge entity , attribute, attribute value.
在一些实施例的步骤301中,可以采用网络爬虫等技术手段来爬取预设知识点等相关数据;也可以从预设的数据库中获取相关数据。在一些应用场景,预设知识点为预设的英语知识点,例如英语线教育场景中的英语考查点。In step 301 of some embodiments, technical means such as a web crawler may be used to crawl relevant data such as preset knowledge points; relevant data may also be obtained from a preset database. In some application scenarios, the preset knowledge points are preset English knowledge points, such as English test points in the English online education scenario.
在一些实施例的步骤302中,构建英语知识图谱的原理为:根据预设知识点的每一个知识点构建第一三元组和第二三元组,其中第一三元组包括第一知识实体、关系、第二知识实体,第二三元组包括第二知识实体、属性、属性值。通过第第一个三元组,建立第一知识实体与第二知识实体之间的关联关系,具体地,通过无向边建立第一知识实体与第二知识实体之间的关联关系的连接。关于第一三元组的说明:若两个知识节点之间存在关系,则存在关系的两个知识节点之间用一条无向边连接在一起,该知识节点称为实体,该无向边表示这两 个知识节点之间的关系,本公开实施例中,这两个知识节点对应第一知识实体与第二知识实体。在第二三元组,第二知识实体表征对应的英语知识点的名称,第二三元组表征了:对应的英语知识点的名称、该英语知识点具备的属性、该属性对应的属性值。In step 302 of some embodiments, the principle of constructing the English knowledge map is: constructing the first triplet and the second triplet according to each knowledge point of the preset knowledge points, wherein the first triplet includes the first knowledge Entity, relation, second knowledge entity, the second triple group includes second knowledge entity, attribute, attribute value. Through the first triple group, the association relationship between the first knowledge entity and the second knowledge entity is established, specifically, the connection of the association relationship between the first knowledge entity and the second knowledge entity is established through an undirected edge. Explanation on the first triple: if there is a relationship between two knowledge nodes, then the two knowledge nodes with the relationship are connected together by an undirected edge. The knowledge node is called an entity, and the undirected edge represents The relationship between the two knowledge nodes, in the embodiment of the present disclosure, the two knowledge nodes correspond to the first knowledge entity and the second knowledge entity. In the second triplet, the second knowledge entity represents the name of the corresponding English knowledge point, and the second triplet represents: the name of the corresponding English knowledge point, the attribute of the English knowledge point, and the attribute value corresponding to the attribute .
在一具体应用场景,第一三元组可以表示为:从句-包含-定语从句;或者第一三元组可以表示为:从句-包含-状语从句;其中,【从句】为对应的英语知识点,该英语知识点包括【定语从句】和【状语从句】两个知识点,内部的关系为包含。In a specific application scenario, the first triple can be expressed as: clause-include-attributive clause; or the first triple can be expressed as: clause-include-adverbial clause; where [clause] is the corresponding English knowledge point , this English knowledge point includes [attributive clause] and [adverbial clause] two knowledge points, and the internal relationship is containment.
在一具体应用场景,第二三元组可以表示为:定语从句-年级-8年级,定语从句-关系词-which;其中,该【定语从句】有一个【年级】的属性,这个【年级】的属性值是【8年级】,代表该【定语从句】是【8年级】的知识点。同时,该【定语从句】还有一个【关系词】的属值,这个【关系词】的属性值是which。In a specific application scenario, the second triple can be expressed as: attributive clause-grade-grade 8, attributive clause-relative word-which; among them, the [attributive clause] has an attribute of [grade], and the [grade] The attribute value of is [Grade 8], which means that the [attributive clause] is a knowledge point of [Grade 8]. At the same time, the [attributive clause] also has an attribute value of [relative word], and the attribute value of this [relative word] is which.
在一具体应用场景,第一三元组可以表示为:从句-包含-定语从句;或者第一三元组可以表示为:从句-包含-状语从句;其中,【从句】为对应的英语知识点,该英语知识点包括【定语从句】和【状语从句】两个知识点,内部的关系为包含。In a specific application scenario, the first triple can be expressed as: clause-include-attributive clause; or the first triple can be expressed as: clause-include-adverbial clause; where [clause] is the corresponding English knowledge point , this English knowledge point includes [attributive clause] and [adverbial clause] two knowledge points, and the internal relationship is containment.
本公开实施例,通过将英语知识点构建知识图谱,能够清晰地知道英语知识点的组成结构,以及英语知识点的考察点;此外还可以通过计算两个知识点之间的边的总和,来判断两个知识点是否是相似的知识点,具体可以参照相关技术进行判断,本公开实施例不做限定。In the embodiment of the present disclosure, by constructing a knowledge map of English knowledge points, the composition structure of English knowledge points and the inspection points of English knowledge points can be clearly known; in addition, the sum of edges between two knowledge points can be calculated to Whether two knowledge points are similar knowledge points can be judged with reference to related technologies, which is not limited in the embodiments of the present disclosure.
请参阅图4,在一些实施例的步骤103中,预设的知识图谱包括第一三元组和第二三元组,根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值,具体可以包括但不限于包括:Please refer to FIG. 4, in step 103 of some embodiments, the preset knowledge graph includes the first triplet and the second triplet, and the option data and answer data are encoded according to the preset knowledge graph to obtain the option Attribute values and answer attribute values, which may specifically include but are not limited to include:
在一些实施例,知识图谱包括第一三元组和多个第二三元组,根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值,包括:In some embodiments, the knowledge graph includes a first triplet and multiple second triplets, and the option data and answer data are encoded according to a preset knowledge graph to obtain option attribute values and answer attribute values, including:
步骤401、根据第一三元组和多个第二三元组对选项数据进行编码处理,得到选项属性值;其中,选项属性值包括多个第二三元组的属性值;Step 401: Encoding option data according to the first triplet and multiple second triplets to obtain option attribute values; wherein, the option attribute value includes attribute values of multiple second triplets;
步骤402、根据第一三元组和其中一个第二三元组对答案数据进行编码处理,得到答案属性值;其中,答案属性值是选项属性值中的多个属性值中的其中一个属性值。Step 402: Encode the answer data according to the first triplet and one of the second triplets to obtain the answer attribute value; wherein, the answer attribute value is one of the multiple attribute values in the option attribute value .
具体地,为了提高模型的准确率,本公开实施例对于选项和答案的编码阶段引入了知识图谱的知识信息。将题目的选项和答案通过知识图谱的第一三元组和第二三元组的相关信息获取知识实体。具体地,在一具体应用场景,以一个英语单选题的场景为例进行说明,一道考察定语从句的单选题中,题干数据中给定了一个包含有从句内容的句子:My house,which I bought last year,has got a lovely garden。题干数据中要求判断从句“which I bought last year”的从句类型。选项数据为:A、B、C、D四个选项,其中选项A为状语从句,选项B为主语从句,选项C为定语从句,选项D为表语从句。答案只有一个,答案数据对应为:定语从句。即该单选题的答案是选项C。知识图谱的第一三元组表示为:从句-包含-定语从句,第二三元组为:定语从句-关系词-which。从句“which I bought last year”中的“which”为关系词,对应该从句的类型为“定语从句”,即第二三元组的表达式:定语从句-关系词-which。判断该从句“which I bought last year”的类型所对应的答案为:该从句为定义从句,该答案对应于第一三元组的表达式:从句-包含-定语从句。根据第一三元组和多个第二三元组对选项数据进行编码处理,得到的选项属性值为:状语从句,主语从句,定语从句,表语从 句。根据第一三元组和其中一个第二三元组对答案数据进行编码处理,得到的答案属性值为:定语从句(即选项属性值中的定语从句);该应用场景中,考察的英语知识点是从句中的定语从句的判断。Specifically, in order to improve the accuracy of the model, the embodiment of the present disclosure introduces the knowledge information of the knowledge map to the encoding stage of the options and answers. The options and answers of the questions are used to obtain knowledge entities through the relevant information of the first triplet and the second triplet of the knowledge graph. Specifically, in a specific application scenario, take an English multiple-choice question as an example. In a multiple-choice question that investigates attributive clauses, a sentence containing clause content is given in the question stem data: My house, which I bought last year, has got a lovely garden. In the question stem data, it is required to judge the clause type of the clause "which I bought last year". The option data is: A, B, C, D four options, where option A is an adverbial clause, option B is a main clause, option C is an attributive clause, and option D is an predicative clause. There is only one answer, and the answer data corresponds to: attributive clause. That is, the answer to this multiple choice question is option C. The first triple of the knowledge map is expressed as: clause-contains-attributive clause, and the second triple is: attributive clause-relative word-which. The "which" in the clause "which I bought last year" is a relative word, and the type of the corresponding clause is "attributive clause", which is the expression of the second triple: attributive clause-relative word-which. The answer corresponding to the type of the clause "which I bought last year" is: the clause is a defining clause, and the answer corresponds to the expression of the first triple: clause-contains-attributive clause. The option data is encoded according to the first triplet and multiple second triplets, and the obtained option attribute values are: adverbial clause, subject clause, attributive clause, and predicative clause. The answer data is encoded according to the first triplet and one of the second triplets, and the obtained answer attribute value is: attributive clause (that is, the attributive clause in the option attribute value); in this application scenario, the English knowledge investigated The point is the judgment of the attributive clause in the clause.
请参阅图5,在一些实施例的步骤104中,将选项属性值和答案属性值进行分词和拼接处理,得到选项答案表征向量,具体可以包括但不限于包括:Referring to Fig. 5, in step 104 of some embodiments, word segmentation and splicing are performed on the option attribute value and the answer attribute value to obtain the option answer representation vector, which may specifically include but not limited to include:
步骤501、将选项属性值和答案属性值进行词向量化,得到词向量化的选项属性值和答案属性值; Step 501, perform word vectorization on the option attribute value and answer attribute value, and obtain the option attribute value and answer attribute value of word vectorization;
步骤502、将词向量化的选项属性值和答案属性值进行属性值拼接,得到选项答案表征向量。 Step 502 , concatenate the option attribute values and answer attribute values quantized to obtain option answer representation vectors.
具体地,在一些实施例,将选项属性值和答案属性值所对应的分别进行知识词向量化为对应选项属性值的一个向量token和对应答案属性值的一个向量token,然后将这两个向量token进行拼接,得到选项答案表征向量。Specifically, in some embodiments, the knowledge words corresponding to the option attribute value and the answer attribute value are vectorized into a vector token corresponding to the option attribute value and a vector token corresponding to the answer attribute value, and then the two vectors The tokens are spliced to obtain the option answer representation vector.
应理解,在其他的实施例,可以先将选项属性值和答案属性值进行拼接,得到选项答案属性值,再对选项答案属性值进行词向量化为对应选项答案的一个向量token,即选项答案表征向量。It should be understood that in other embodiments, the attribute value of the option and the attribute value of the answer can be spliced first to obtain the attribute value of the option answer, and then the attribute value of the option answer is vectorized into a vector token corresponding to the option answer, that is, the option answer representation vector.
在一具体应用场景,选项属性值作为序列的句子A,答案属性值作为句子B,A、B两个句子拼接成选项答案表征向量。具体地,选项答案表征向量可以是长度为320的序列;若该选项答案表征向量的长度不够320,则需要对该选项答案表征向量进行补零操作;且因选项属性值有可能会很长,因此需要对选项属性值进行截断,每次截掉较长句子的尾部,直到整个选项答案表征向量的长度为320。In a specific application scenario, the option attribute value is a sequence of sentences A, the answer attribute value is a sentence B, and the two sentences A and B are concatenated into an option answer representation vector. Specifically, the option answer characterization vector can be a sequence with a length of 320; if the length of the option answer characterization vector is not 320, the option answer characterization vector needs to be zero-filled; and because the option attribute value may be very long, Therefore, it is necessary to truncate the attribute value of the option, and cut off the tail of a longer sentence each time until the length of the entire option answer representation vector is 320.
在一道考察定语从句的单选题的应用场景中,题干数据给定了一个从句内容,要求判断该从句内容的从句类型,选项为A、B、C、D四个选项,选项A为状语从句,选项B为主语从句,选项C为定语从句,选项D为表语从句;答案数据对应为:定语从句;即选项属性值包括状语从句、主语从句、定语从句、表语从句;答案属性值为定语从句。因此,将选项属性值和答案属性值进行分词和拼接处理后得到的选项答案表征向量则表示为[状语从句,主语从句,定语从句,表语从句,定语从句]。In an application scenario of a multiple-choice question examining attributive clauses, the question stem data is given a clause content, and it is required to judge the clause type of the clause content. The options are A, B, C, and D. Option A is an adverbial Dependent clauses, option B is the main clause, option C is the attributive clause, and option D is the predicative clause; the answer data corresponds to: attributive clause; that is, the option attribute value includes the adverbial clause, the subject clause, the attributive clause, and the predicative clause; the answer attribute value as an attributive clause. Therefore, the option answer representation vector obtained after word segmentation and concatenation of the option attribute value and the answer attribute value is expressed as [adverbial clause, subject clause, attributive clause, predicative clause, attributive clause].
在一些实施例的步骤105中,将题干表征向量和选项答案表征向量进行向量拼接,得到题目数据,具体可以包括但不限于包括:In step 105 of some embodiments, the question stem characterization vector and the option answer characterization vector are vector concatenated to obtain question data, which may specifically include but not limited to include:
通过分隔符将题干表征向量和选项答案表征向量进行向量拼接,得到题目数据。The question stem representation vector and the option answer representation vector are vector-spliced through separators to obtain the question data.
在一些实施例,该分隔符可以是一对占位符:第一占位符[CLS]与第二占位符[SEP],其中第一占位符[CLS]表示序列的开始,第二占位符[SEP]表示序列的结束。其中,CLS(classifer token),也叫分类器标识符或者标识符,是一种特殊的token,该token的词嵌入通常用于进行分类任务;SEP(sentence separator)也叫语句分隔标识符或者分隔符,也是一种特殊的token,可应用于分隔两个句子。In some embodiments, the delimiter can be a pair of placeholders: a first placeholder [CLS] and a second placeholder [SEP], wherein the first placeholder [CLS] represents the beginning of the sequence, and the second The placeholder [SEP] indicates the end of the sequence. Among them, CLS (classifer token), also called classifier identifier or identifier, is a special token whose word embedding is usually used for classification tasks; SEP (sentence separator) is also called sentence separation identifier or separation A character is also a special token that can be used to separate two sentences.
通过分隔符将题干表征向量和选项答案表征向量进行向量拼接,得到题目数据,具体包括:The question stem characterization vector and the option answer characterization vector are vector-spliced through the separator to obtain the question data, including:
将题干表征向量设在第一占位符与第二占位符之间、第二占位符设在题干表征向量和选项答案表征向量之间,对题干表征向量和选项答案表征向量进行向量拼接,得到题目数据。 具体地,题目数据的表示形式为:[<CLS>,题干表征向量,<SEP>,选项答案表征向量]The question stem characterization vector is set between the first placeholder and the second placeholder, the second placeholder is set between the question stem characterization vector and the option answer characterization vector, and the question stem characterization vector and the option answer characterization vector Perform vector splicing to obtain the topic data. Specifically, the representation form of the question data is: [<CLS>, question stem representation vector, <SEP>, option answer representation vector]
以下以具体应用场景进行说明:The following describes specific application scenarios:
例如题干表征向量为:i,am,play,ingFor example, the stem representation vector is: i, am, play, ing
选项答案表征向量为:[状语从句,主语从句,定语从句,表语从句,定语从句]The option answer representation vector is: [adverbial clause, subject clause, attributive clause, predicative clause, attributive clause]
则通过分隔符将题干表征向量和选项答案表征向量进行向量拼接得到的题目数据为:Then the question data obtained by vector splicing the question stem representation vector and the option answer representation vector through the separator is:
[<CLS>,i,am,play,ing,<SEP>,状语从句,定语从句,定语从句,表语从句,定语从句][<CLS>,i,am,play,ing,<SEP>, adverbial clause, attributive clause, attributive clause, predicative clause, attributive clause]
在一些实施例的步骤106中,预设的预训练模型可以为BERT模型;具体地,根据步骤105得到的题目数据作为BERT模型的输入对BERT模型进行训练,得到知识分类模型,该知识分类模型的基础框架为BERT模型;该知识分类模型用于预测出目标题目的知识类型;具体地,知识分类模型包括softmax分类器;该知识分类模型根据输入的题目数据获取<CLS>对应的特征向量信息,<CLS>经过一个softmax分类器后可以预测出目标题目的知识类型。其中,目标题目是输入到知识分类模型的题目,例如可以是选择题题目,更具体地,在一个英语单选题的场景,目标题目可以是一道考察定语从句的单选题。In step 106 of some embodiments, the preset pre-training model can be a BERT model; specifically, according to the topic data obtained in step 105 as the input of the BERT model, the BERT model is trained to obtain a knowledge classification model, the knowledge classification model The basic framework of BERT is the BERT model; the knowledge classification model is used to predict the knowledge type of the target topic; specifically, the knowledge classification model includes a softmax classifier; the knowledge classification model obtains the feature vector information corresponding to <CLS> according to the input topic data , <CLS> can predict the knowledge type of the target topic after passing through a softmax classifier. Wherein, the target topic is a topic input into the knowledge classification model, for example, it may be a multiple-choice topic, and more specifically, in the case of an English multiple-choice question, the target topic may be a multiple-choice question examining attributive clauses.
应理解,对于每一个token级别的词,包括:token嵌入、位置嵌入、分段嵌入;其中token嵌入是该token经过模型在语料库上预训练得到的一个关于该词在整个语料库上的向量表示;位置嵌入是当前token在该序列中的位置索引;分段嵌入是在这个序列中标注是句子A还是句子B,其中将token属于句子A的分段嵌入为0,属于句子B的分段嵌入为1。将token嵌入、位置嵌入、分段嵌入三种嵌入拼接在一起就形成了每一个token的词嵌入,将整个序列的嵌入输入到多层的双向Transformer编码器中,取最后隐藏层的第1个token(即[CLS])对应的向量作为整个句子的聚合表示,即该向量代表了整个选项序列的向量表示。在该实施例中,题目数据所表示的序列经过softmax分类器即可预测出题目的知识类型。It should be understood that for each token-level word, it includes: token embedding, position embedding, and segment embedding; wherein the token embedding is a vector representation of the word on the entire corpus obtained by the token after the model is pre-trained on the corpus; The positional embedding is the position index of the current token in the sequence; the segmental embedding is to mark whether it is sentence A or sentence B in this sequence, where the segmental embedding of the token belonging to sentence A is 0, and the segmental embedding of the token belonging to sentence B is 1. The three embeddings of token embedding, position embedding, and segment embedding are spliced together to form the word embedding of each token, and the embedding of the entire sequence is input into the multi-layer bidirectional Transformer encoder, and the first one of the last hidden layer is taken The vector corresponding to the token (namely [CLS]) is used as the aggregate representation of the entire sentence, that is, the vector represents the vector representation of the entire option sequence. In this embodiment, the knowledge type of the topic can be predicted by passing the sequence represented by the topic data through the softmax classifier.
本公开实施例,通过获取原始标注数据,并对原始标注数据中的题干数据进行编码处理,得到题干表征向量,并根据预设的知识图谱对原始标注数据中的选项数据和答案数据进行编码处理,从而可以得到选项属性值和答案属性值,再将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量,再将题干表征向量和选项答案表征向量进行向量拼接,从而可以得到题目数据,最后根据题目数据对预设的预训练模型进行训练,得到知识分类模型,该知识分类模型可以用于对目标题目进行知识分类处理,以得到知识点类型,本公开实施例得到的知识分类模型可以提高对知识分类的准确性和效率。In the embodiment of the present disclosure, by obtaining the original annotation data and encoding the question stem data in the original annotation data, the question stem representation vector is obtained, and the option data and answer data in the original annotation data are processed according to the preset knowledge graph. Encoding processing, so that the option attribute value and the answer attribute value can be obtained, and then the option attribute value and the answer attribute value are subjected to word segmentation and splicing processing to obtain the option answer representation vector, and then the question stem representation vector and the option answer representation vector Perform vector splicing to obtain topic data, and finally train the preset pre-training model according to the topic data to obtain a knowledge classification model. This knowledge classification model can be used to perform knowledge classification processing on the target topic to obtain the type of knowledge points. The knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.
本公开实施例,基于知识图谱和深度学习,对英语单选择题的题目进行分类,能够利用模型自动区分题目所考察的知识点。相比于常规的分类方法,本公开实施例的技术方案,可以提高知识分类的准确性和效率,通过引入选项和答案的知识图谱编码信息(三元组信息),能够更加准确的预测题目的知识类型。在固定标注样本的成本下,可以更高效地对新题目进行分类。In the embodiment of the present disclosure, based on the knowledge map and deep learning, the topics of English multiple-choice questions are classified, and the model can be used to automatically distinguish the knowledge points investigated by the questions. Compared with conventional classification methods, the technical solutions of the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification, and by introducing the knowledge map coding information (triple information) of options and answers, it is possible to more accurately predict the content of the topic. knowledge type. With a fixed cost of labeled samples, new topics can be classified more efficiently.
请参阅图6,本公开实施例,还提供一种选择题的知识分类方法,本公开实施例提供的选择题的知识分类方法,涉及机器学习技术领域。本公开实施例提供的选择题的知识分类方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等; 服务器端可以配置成独立的物理服务器,也可以配置成多个物理服务器构成的服务器集群或者分布式系统,还可以配置成提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN以及大数据和人工智能平台等基础云计算服务的云服务器;软件可以是实现选择题的知识分类方法的应用等,但并不局限于以上形式。Please refer to FIG. 6 , the embodiment of the present disclosure also provides a knowledge classification method for multiple-choice questions. The knowledge classification method for multiple-choice questions provided by the embodiment of the present disclosure relates to the technical field of machine learning. The multiple-choice knowledge classification method provided by the embodiments of the present disclosure can be applied to a terminal, can also be applied to a server, and can also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, or a smart watch; the server end can be configured as an independent physical server, or as a server cluster composed of multiple physical servers or as a distributed The system can also be configured to provide basic cloud computing such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The cloud server of the service; the software can be the application of knowledge classification methods to realize multiple-choice questions, but it is not limited to the above forms.
图6是本公开实施例提供的选择题的知识分类方法的一个可选的流程图,图6中的方法可以包括但不限于包括步骤601至步骤604:Fig. 6 is an optional flow chart of the multiple-choice knowledge classification method provided by the embodiment of the present disclosure. The method in Fig. 6 may include but not limited to steps 601 to 604:
步骤601、获取待分类的选择题数据;其中,选择题数据包括题干数据、选项数据和答案数据; Step 601. Obtain multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data, option data and answer data;
步骤602、将选择题数据输入至知识分类模型;其中,知识分类模型为根据上述第一方面的方法训练得到; Step 602, input multiple-choice question data into the knowledge classification model; wherein, the knowledge classification model is obtained through training according to the method of the first aspect above;
步骤603、通过知识分类模型对选择题数据进行特征提取,得到特征向量信息; Step 603, perform feature extraction on the multiple-choice question data through the knowledge classification model, and obtain feature vector information;
步骤604、根据特征向量信息进行知识分类处理,得到知识点类型。 Step 604, performing knowledge classification processing according to the feature vector information to obtain knowledge point types.
具体地,在步骤601中,待分类的选择题数据包括题干数据、选项数据和答案数据。选择题数据不同于原始标注数据的是:原始标注数据包括知识点类型,选择题数据不包括知识点类型。Specifically, in step 601, multiple-choice question data to be classified includes question stem data, option data and answer data. The multiple choice data is different from the original label data: the original label data includes knowledge point types, and the multiple choice data does not include knowledge point types.
应理解,前述所述的目标题目包括待分类的选择题数据。It should be understood that the aforementioned target questions include multiple-choice question data to be classified.
在一些实施例,该知识分类模型包括softmax分类器。In some embodiments, the knowledge classification model includes a softmax classifier.
该选择题的知识分类方法中,通过知识分类模型对选择题数据进行特征提取,获取<CLS>对应的特征向量信息,得到的特征向量信息包括题干表征向量和选项答案表征向量;其中,该题干表征向量与上述知识分类模型的训练方法中的题干表征向量相同,即本实施例的题干表征向量设在第一占位符<CLS>与第二占位符<SEP>之间,也可以说题干表征向量包括了第一占位符<CLS>;本实施例选择题的知识分类方法与上述知识分类模型的训练方法相同的还包括:第二占位符<SEP>设在题干表征向量和选项答案表征向量之间,也可以说选项答案表征向量包括了第二占位符<SEP>。In the knowledge classification method for multiple-choice questions, feature extraction is performed on the data of multiple-choice questions through the knowledge classification model, and the feature vector information corresponding to <CLS> is obtained, and the obtained feature vector information includes question stem representation vectors and option answer representation vectors; where, the The question stem characterization vector is the same as the question stem characterization vector in the above-mentioned knowledge classification model training method, that is, the question stem characterization vector in this embodiment is set between the first placeholder <CLS> and the second placeholder <SEP> , it can also be said that the question stem representation vector includes the first placeholder <CLS>; the knowledge classification method of the multiple-choice question in this embodiment is the same as the training method of the above-mentioned knowledge classification model and also includes: the second placeholder <SEP> set Between the question stem representation vector and the option answer representation vector, it can also be said that the option answer representation vector includes a second placeholder <SEP>.
在一些实施例的步骤604中,根据步骤603得到的<CLS>对应的特征向量信息,经过一个softmax分类器,softmax分类器可以根据<CLS>对应的特征向量信息进行字数分类处理,从而预测出题目的知识类型。In step 604 of some embodiments, according to the feature vector information corresponding to <CLS> obtained in step 603, through a softmax classifier, the softmax classifier can perform word count classification processing according to the feature vector information corresponding to <CLS>, thereby predicting The knowledge type of the topic.
在一些应用场景,例如英语在线教育场景中,需要对考察相关英语知识点的题目进行分类,从而把相同知识点的题目进行划分,对用户进行专项训练。由于题目的数量过于庞大,而且每年都会研发一些新题目;若依靠人工对每道题进行划分,工作量大、效率低、容易出错。本公开实施例,通过构建相关的英语知识图谱,并应用深度学习的方法,对英语选择题的题目进行分类,能够利用模型自动区分题目所考察的知识点。In some application scenarios, such as English online education scenarios, it is necessary to classify the topics for examining relevant English knowledge points, so as to divide the topics of the same knowledge points and conduct special training for users. Since the number of questions is too large, and some new questions are developed every year; if each question is divided manually, the workload is heavy, the efficiency is low, and it is easy to make mistakes. In the embodiment of the present disclosure, by constructing a relevant English knowledge map and applying a deep learning method to classify the topics of English multiple-choice questions, the model can be used to automatically distinguish the knowledge points investigated by the questions.
本公开实施例,通过获取原始标注数据,并对原始标注数据中的题干数据进行编码处理,得到题干表征向量,并根据预设的知识图谱对原始标注数据中的选项数据和答案数据进行编码处理,从而可以得到选项属性值和答案属性值,再将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量,再将题干表征向量和选项答案表征向量进行向量拼接,从而可以得到题目数据,最后根据题目数据对预设的预训练模型进行训练,得到知识分类模型,该知识分类模型可以用于对目标题目进行知识分类处理,以得到知识点类型, 本公开实施例得到的知识分类模型可以提高对知识分类的准确性和效率。In the embodiment of the present disclosure, by obtaining the original annotation data and encoding the question stem data in the original annotation data, the question stem representation vector is obtained, and the option data and answer data in the original annotation data are processed according to the preset knowledge graph. Encoding processing, so that the option attribute value and the answer attribute value can be obtained, and then the option attribute value and the answer attribute value are subjected to word segmentation and splicing processing to obtain the option answer representation vector, and then the question stem representation vector and the option answer representation vector Perform vector splicing to obtain topic data, and finally train the preset pre-training model according to the topic data to obtain a knowledge classification model. This knowledge classification model can be used to perform knowledge classification processing on the target topic to obtain the type of knowledge points. The knowledge classification model obtained in the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification.
本公开实施例,基于知识图谱和深度学习,对英语选择题的题目进行分类,能够利用模型自动区分题目所考察的知识点。相比于常规的分类方法,本公开实施例的技术方案,可以提高知识分类的准确性和效率,通过引入选项和答案的知识图谱编码信息(三元组信息),能够更加准确的预测题目的知识类型。在固定标注样本的成本下,可以更高效地对新题目进行分类。In the embodiment of the present disclosure, based on the knowledge map and deep learning, the topics of English multiple-choice questions are classified, and the model can be used to automatically distinguish the knowledge points investigated by the questions. Compared with conventional classification methods, the technical solutions of the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification, and by introducing the knowledge map coding information (triple information) of options and answers, it is possible to more accurately predict the content of the topic. knowledge type. With a fixed cost of labeled samples, new topics can be classified more efficiently.
请参阅图7,本公开实施例还提供一种知识分类模型的训练装置,可以实现上述知识分类模型的训练方法,该知识分类模型的训练装置包括:原始数据获取模块,用于获取原始标注数据;原始标注数据包括题干数据、选项数据和答案数据;题干编码模块,用于对所述题干数据进行编码处理,得到题干表征向量;选项答案编码模块,用于根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;分词和拼接模块,用于将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;向量拼接模块,用于将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;分类模型训练模块,用于根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。Please refer to FIG. 7, an embodiment of the present disclosure also provides a training device for a knowledge classification model, which can implement the above-mentioned training method for the knowledge classification model. The training device for the knowledge classification model includes: an original data acquisition module for obtaining original label data ; The original labeling data includes question stem data, option data and answer data; the question stem encoding module is used to encode the question stem data to obtain the question stem representation vector; the option answer encoding module is used to obtain the question stem representation vector according to the preset knowledge The map encodes the option data and the answer data to obtain the option attribute value and the answer attribute value; the word segmentation and splicing module is used to perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain the option answer representation vector ; A vector splicing module, used for vector splicing the question stem representation vector and the option answer representation vector to obtain topic data; a classification model training module, used for training a preset pre-training model according to the topic data , to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
本公开实施例的知识分类模型的训练类装置用于执行上述实施例中的知识分类模型的训练方法,其具体处理过程与上述实施例中的知识分类模型的训练方法相同,此处不再一一赘述。The knowledge classification model training device in the embodiment of the present disclosure is used to execute the knowledge classification model training method in the above embodiment, and its specific processing process is the same as the knowledge classification model training method in the above embodiment, which will not be repeated here. A repeat.
请参阅图8,本公开实施例还提供一种选择题的知识分类装置,可以实现上述选择题的知识分类方法,该选择题的知识分类装置包括:选择题数据获取模块,用于获取待分类的选择题数据;其中,选择题数据包括题干数据、选项数据和答案数据;数据输入模块,用于将选择题数据输入至知识分类模型;其中,知识分类模型为根据上述第一方面的知识分类模型的训练方法训练得到;特征提取模块,用于通过知识分类模型对选择题数据进行特征提取,得到特征向量信息;知识分类模块,用于根据特征向量信息进行知识分类处理,得到知识点类型。Please refer to FIG. 8 , an embodiment of the present disclosure also provides a knowledge classification device for multiple-choice questions, which can realize the knowledge classification method for the above-mentioned multiple-choice questions. The knowledge classification device for multiple-choice questions includes: a data acquisition module for multiple-choice questions, used to obtain Multiple-choice question data; wherein, the multiple-choice question data includes question stem data, option data and answer data; the data input module is used to input the multiple-choice question data into the knowledge classification model; wherein, the knowledge classification model is the knowledge according to the above-mentioned first aspect The training method of the classification model is trained; the feature extraction module is used to extract the features of the multiple choice data through the knowledge classification model to obtain the feature vector information; the knowledge classification module is used to perform knowledge classification processing according to the feature vector information to obtain the knowledge point type .
本公开实施例的选择题的知识分类装置用于执行上述实施例中的选择题的知识分类方法,其具体处理过程与上述实施例中的选择题的知识分类方法相同,此处不再一一赘述。The knowledge classification device for multiple-choice questions in the embodiment of the present disclosure is used to implement the knowledge classification method for multiple-choice questions in the above-mentioned embodiments, and its specific processing process is the same as the knowledge classification method for multiple-choice questions in the above-mentioned embodiments, and will not be repeated here. repeat.
本公开实施例还提供了一种计算机设备,包括:An embodiment of the present disclosure also provides a computer device, including:
至少一个存储器;at least one memory;
至少一个处理器;at least one processor;
至少一个程序;at least one program;
所述程序被存储在存储器中,处理器执行所述至少一个程序以实现本公开实施上述的知识分类模型的训练方法或者选择题的知识分类方法。该计算机设备可以为包括手机、平板电脑、个人数字助理(Personal Digital Assistant,简称PDA)、车载电脑等任意智能终端。The program is stored in the memory, and the processor executes the at least one program to implement the above-mentioned knowledge classification model training method or multiple choice question knowledge classification method in the present disclosure. The computer device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (PDA for short), a vehicle-mounted computer, and the like.
请参阅图9,图9示意了另一实施例的计算机设备的硬件结构,计算机设备包括:Referring to FIG. 9, FIG. 9 illustrates a hardware structure of a computer device in another embodiment, and the computer device includes:
处理器701,可以采用通用的CPU(Central ProcessingUnit,中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit,ASIC)、或者一个或多个集成 电路等方式实现,用于执行相关程序,以实现本公开实施例所提供的技术方案;The processor 701 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to Realize the technical solutions provided by the embodiments of the present disclosure;
存储器702,可以采用ROM(ReadOnlyMemory,只读存储器)、静态存储设备、动态存储设备或者RAM(RandomAccessMemory,随机存取存储器)等形式实现。存储器702可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器702中,并由处理器701来调用执行本公开实施例的知识分类模型的训练方法或者选择题的知识分类方法;其中,知识分类模型的训练方法包括:获取原始标注数据;其中,原始标注数据包括题干数据、选项数据和答案数据;对题干数据进行编码处理,得到题干表征向量;根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;将选项属性值和答案属性值进行分词和拼接处理,得到选项答案表征向量;将题干表征向量和选项答案表征向量进行向量拼接,得到题目数据;根据题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。选择题的知识分类方法包括:获取待分类的选择题数据;其中,所述选择题数据包括题干数据;对所述题干数据进行编码处理,得到题干表征向量;将所述题干表征向量输入至知识分类模型;其中,所述知识分类模型为根据上述知识分类模型的训练方法训练得到;通过所述知识分类模型对所述题干数据进行特征提取,得到特征向量信息;根据所述特征向量信息进行知识分类处理,得到知识点类型。The memory 702 may be implemented in the form of a ROM (ReadOnly Memory, read only memory), a static storage device, a dynamic storage device, or a RAM (Random Access Memory, random access memory). The memory 702 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 702 and called by the processor 701 to execute the implementation of the present disclosure. The training method of the knowledge classification model of the example or the knowledge classification method of the multiple choice questions; wherein, the training method of the knowledge classification model includes: obtaining the original label data; wherein, the original label data includes question stem data, option data and answer data; The data is encoded to obtain the question stem representation vector; the option data and answer data are encoded according to the preset knowledge map to obtain the option attribute value and answer attribute value; the option attribute value and answer attribute value are word-segmented and spliced, Obtain the option answer representation vector; perform vector splicing of the question stem representation vector and the option answer representation vector to obtain the topic data; train the preset pre-training model according to the topic data to obtain the knowledge classification model; where the knowledge classification model is used for The target topic is processed by knowledge classification to obtain the type of knowledge points. The knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem characterization vectors; The vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the stem data through the knowledge classification model to obtain feature vector information; according to the The feature vector information is processed by knowledge classification to obtain the type of knowledge points.
输入/输出接口703,用于实现信息输入及输出;The input/output interface 703 is used to realize information input and output;
通信接口704,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;和The communication interface 704 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
总线705,在设备的各个组件(例如处理器701、存储器702、输入/输出接口703和通信接口704)之间传输信息;A bus 705, which transmits information between various components of the device (such as a processor 701, a memory 702, an input/output interface 703, and a communication interface 704);
其中处理器701、存储器702、输入/输出接口703和通信接口704通过总线705实现彼此之间在设备内部的通信连接。The processor 701 , the memory 702 , the input/output interface 703 and the communication interface 704 are connected to each other within the device through the bus 705 .
本公开实施例还提供了一种存储介质,该存储介质是计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行上述知识分类模型的训练方法或者选择题的知识分类方法;其中,知识分类模型的训练方法包括:获取原始标注数据;其中,原始标注数据包括题干数据、选项数据和答案数据;对题干数据进行编码处理,得到题干表征向量;根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;将选项属性值和答案属性值进行分词和拼接处理,得到选项答案表征向量;将题干表征向量和选项答案表征向量进行向量拼接,得到题目数据;根据题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。选择题的知识分类方法包括:获取待分类的选择题数据;其中,所述选择题数据包括题干数据;对所述题干数据进行编码处理,得到题干表征向量;将所述题干表征向量输入至知识分类模型;其中,所述知识分类模型为根据上述知识分类模型的训练方法训练得到;通过所述知识分类模型对所述题干数据进行特征提取,得到特征向量信息;根据所述特征向量信息进行知识分类处理,得到知识点类型。An embodiment of the present disclosure also provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make the computer execute the above-mentioned knowledge classification model training method or multiple-choice knowledge classification method; wherein, the knowledge classification model training method includes: obtaining the original Labeling data; wherein, the original labeling data includes question stem data, option data, and answer data; the question stem data is encoded to obtain the question stem representation vector; the option data and answer data are encoded according to the preset knowledge map to obtain option attribute value and answer attribute value; the option attribute value and answer attribute value are segmented and spliced to obtain the option answer representation vector; the question stem representation vector and the option answer representation vector are vector spliced to obtain the topic data; according to the topic data The preset pre-training model is trained to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points. The knowledge classification method for multiple-choice questions includes: obtaining multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data; encoding the question stem data to obtain question stem characterization vectors; The vector is input to the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the above-mentioned knowledge classification model training method; the feature extraction is performed on the stem data through the knowledge classification model to obtain feature vector information; according to the The feature vector information is processed by knowledge classification to obtain the type of knowledge points.
本公开实施例提出的知识分类模型的训练方法、选择题的知识分类方法、知识分类模型 的训练装置、选择题的知识分类装置、计算机设备、存储介质,通过获取原始标注数据,并对原始标注数据中的题干数据进行编码处理,得到题干表征向量,并根据预设的知识图谱对原始标注数据中的选项数据和答案数据进行编码处理,从而可以得到选项属性值和答案属性值,再将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量,再将题干表征向量和选项答案表征向量进行向量拼接,从而可以得到题目数据,最后根据题目数据对预设的预训练模型进行训练,得到知识分类模型,该知识分类模型可以用于对目标题目进行知识分类处理,以得到知识点类型,本公开实施例得到的知识分类模型可以提高对知识分类的准确性和效率。The training method of the knowledge classification model, the knowledge classification method of the multiple choice questions, the training device of the knowledge classification model, the knowledge classification device of the multiple choice questions, the computer equipment, and the storage medium proposed by the embodiments of the present disclosure obtain the original labeling data, and the original labeling The question stem data in the data is encoded to obtain the question stem representation vector, and the option data and answer data in the original annotation data are encoded according to the preset knowledge map, so that the option attribute value and answer attribute value can be obtained, and then Segment and concatenate the option attribute value and the answer attribute value to obtain the option answer characterization vector, and then perform vector splicing on the question stem characterization vector and the option answer characterization vector, so that the topic data can be obtained, and finally according to the topic data. The preset pre-training model is trained to obtain a knowledge classification model. The knowledge classification model can be used to perform knowledge classification processing on the target topic to obtain the type of knowledge points. The knowledge classification model obtained in the embodiment of the present disclosure can improve the accuracy of knowledge classification. accuracy and efficiency.
本公开实施例,基于知识图谱和深度学习,对英语选择题的题目进行分类,能够利用模型自动区分题目所考察的知识点。相比于常规的分类方法,本公开实施例的技术方案,可以提高知识分类的准确性和效率,通过引入选项和答案的知识图谱编码信息(三元组信息),能够更加准确的预测题目的知识类型。在固定标注样本的成本下,可以更高效地对新题目进行分类。In the embodiment of the present disclosure, based on the knowledge map and deep learning, the topics of English multiple-choice questions are classified, and the model can be used to automatically distinguish the knowledge points investigated by the questions. Compared with conventional classification methods, the technical solutions of the embodiments of the present disclosure can improve the accuracy and efficiency of knowledge classification, and by introducing the knowledge map coding information (triple information) of options and answers, it is possible to more accurately predict the content of the topic. knowledge type. With a fixed cost of labeled samples, new topics can be classified more efficiently.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
本公开实施例描述的实施例是为了更加清楚的说明本公开实施例的技术方案,并不构成对于本公开实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本公开实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments described in the embodiments of the present disclosure are to illustrate the technical solutions of the embodiments of the present disclosure more clearly, and do not constitute limitations on the technical solutions provided by the embodiments of the present disclosure. Those skilled in the art know that with the evolution of technology and new For the emergence of application scenarios, the technical solutions provided by the embodiments of the present disclosure are also applicable to similar technical problems.
本领域技术人员可以理解的是,图1-6中示出的技术方案并不构成对本公开实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。Those skilled in the art can understand that the technical solutions shown in FIGS. 1-6 do not constitute a limitation to the embodiments of the present disclosure, and may include more or fewer steps than those shown in the illustrations, or combine certain steps, or be different. A step of.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
以上参照附图说明了本公开实施例的优选实施例,并非因此局限本公开实施例的权利范围。本领域技术人员不脱离本公开实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本公开实施例的权利范围之内。The preferred embodiments of the embodiments of the present disclosure have been described above with reference to the accompanying drawings, which do not limit the scope of rights of the embodiments of the present disclosure. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present disclosure shall fall within the scope of rights of the embodiments of the present disclosure.

Claims (20)

  1. 一种知识分类模型的训练方法,其中,包括:A training method for a knowledge classification model, comprising:
    获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;Obtaining original annotation data; wherein, the original annotation data includes question stem data, option data and answer data;
    对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
    根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
    将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;Segmenting and concatenating the option attribute value and the answer attribute value to obtain an option answer representation vector;
    将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;performing vector splicing of the question stem characterization vector and the option answer characterization vector to obtain question data;
    根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。A preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on a target topic to obtain knowledge point types.
  2. 根据权利要求1所述的方法,其中,所述对所述题干数据进行编码处理,得到题干表征向量,包括:The method according to claim 1, wherein said subject data is encoded to obtain a subject characterization vector, comprising:
    对所述题干数据进行预处理,将所述题干数据的英文内容转为小写,得到初步题干序列;Preprocessing the stem data, converting the English content of the stem data to lowercase to obtain a preliminary stem sequence;
    对所述初步题干序列进行分词处理,得到题干表征向量。Word segmentation is performed on the preliminary question stem sequence to obtain a question stem representation vector.
  3. 根据权利要求1所述的方法,其中,在所述根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值之前,所述方法还包括:构建所述知识图谱,具体包括:The method according to claim 1, wherein, before the option data and answer data are encoded according to the preset knowledge map to obtain option attribute values and answer attribute values, the method further includes: constructing the knowledge Spectrum, including:
    获取预设知识点;Obtain preset knowledge points;
    根据所述预设知识点构建第一三元组和第二三元组;Constructing a first triplet and a second triplet according to the preset knowledge points;
    根据所述第一三元组和所述第二三元组构建所述知识图谱;其中,所述第一三元组包括第一知识实体、关系、第二知识实体,所述第二三元组包括第二知识实体、属性、属性值。Construct the knowledge map according to the first triple and the second triple; wherein, the first triple includes a first knowledge entity, a relationship, a second knowledge entity, and the second triple A group includes a second knowledge entity, an attribute, and an attribute value.
  4. 根据权利要求3所述的方法,其中,所述知识图谱包括第一三元组和多个第二三元组,根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值,包括:The method according to claim 3, wherein the knowledge graph includes a first triplet and a plurality of second triplets, and the option data and answer data are encoded according to the preset knowledge graph to obtain option attribute values and answer attribute values, including:
    根据所述第一三元组和多个所述第二三元组对所述选项数据进行编码处理,得到所述选项属性值;其中,所述选项属性值包括多个第二三元组的属性值;The option data is encoded according to the first triplet and the plurality of second triplets to obtain the option attribute value; wherein the option attribute value includes the values of the plurality of second triplets attribute value;
    根据所述第一三元组和其中一个第二三元组对所述答案数据进行编码处理,得到所述答案属性值;其中,所述答案属性值是所述选项属性值中的多个属性值中的其中一个属性值。The answer data is encoded according to the first triplet and one of the second triplets to obtain the answer attribute value; wherein the answer attribute value is a plurality of attributes in the option attribute value One of the attribute values in the value.
  5. 根据权利要求1至4任一项所述的方法,其中,所述将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量,包括:The method according to any one of claims 1 to 4, wherein said performing word segmentation and splicing processing on said option attribute value and said answer attribute value to obtain an option answer representation vector, comprising:
    将所述选项属性值和所述答案属性值进行词向量化,得到词向量化的项属性值和答案属性值;Carry out word vectorization with described option attribute value and described answer attribute value, obtain the item attribute value and answer attribute value of word vectorization;
    将词向量化的项属性值和答案属性值进行属性值拼接,得到选项答案表征向量。Concatenate the item attribute value and answer attribute value of word vectorization to obtain the option answer representation vector.
  6. 根据权利要求1至4任一项所述的方法,其中,所述将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据,包括:The method according to any one of claims 1 to 4, wherein said vector splicing of said question stem representation vector and said option answer representation vector to obtain topic data includes:
    通过分隔符将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;其中,分隔符包括第一占位符和第二占位符,通过分隔符将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据,具体包括:The question stem characterization vector and the option answer characterization vector are vector-spliced through a separator to obtain topic data; wherein, the separator includes a first placeholder and a second placeholder, and the question stem is separated by a separator The characterization vector and the option answer characterization vector are vector spliced to obtain the topic data, which specifically includes:
    将题干表征向量设在第一占位符与第二占位符之间、第二占位符设在题干表征向量和选项答案表征向量之间,对题干表征向量和选项答案表征向量进行向量拼接,得到所述题目数据。The question stem characterization vector is set between the first placeholder and the second placeholder, the second placeholder is set between the question stem characterization vector and the option answer characterization vector, and the question stem characterization vector and the option answer characterization vector Perform vector splicing to obtain the subject data.
  7. 一种选择题的知识分类方法,其中,包括:A knowledge classification method for multiple-choice questions, including:
    获取待分类的选择题数据;其中,所述选择题数据包括题干数据、选项数据和答案数据;Obtain multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data, option data and answer data;
    将所述选择题数据输入至知识分类模型;其中,所述知识分类模型为根据权利要求1至6任一项所述的方法训练得到;The multiple-choice data is input into the knowledge classification model; wherein, the knowledge classification model is obtained by training according to the method described in any one of claims 1 to 6;
    通过所述知识分类模型对所述选择题数据行特征提取,得到特征向量信息;Using the knowledge classification model to extract the features of the multiple-choice data row to obtain feature vector information;
    根据所述特征向量信息进行知识分类处理,得到知识点类型。Knowledge classification processing is performed according to the feature vector information to obtain knowledge point types.
  8. 一种知识分类模型的训练装置,其中,包括:A training device for a knowledge classification model, comprising:
    原始数据获取模块,用于获取原始标注数据;原始标注数据包括题干数据、选项数据和答案数据;The original data acquisition module is used to obtain the original annotation data; the original annotation data includes question stem data, option data and answer data;
    题干编码模块,用于对所述题干数据进行编码处理,得到题干表征向量;A question stem coding module, configured to encode the question stem data to obtain a question stem representation vector;
    选项答案编码模块,用于根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;The option answer encoding module is used to encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
    分词和拼接模块,用于将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;A word segmentation and splicing module, used to perform word segmentation and splicing processing on the option attribute value and the answer attribute value to obtain an option answer representation vector;
    向量拼接模块,用于将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;A vector splicing module, configured to splice the question stem representation vector and the option answer representation vector to obtain topic data;
    分类模型训练模块,用于根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。The classification model training module is used to train the preset pre-training model according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on the target topic to obtain the type of knowledge points.
  9. 一种计算机设备,其中,包括:A computer device, comprising:
    至少一个存储器;at least one memory;
    至少一个处理器;at least one processor;
    至少一个程序;at least one program;
    所述程序被存储在所述存储器中,处理器执行所述至少一个程序以实现一种知识分类模型的训练方法,其中,所述知识分类模型的训练方法,包括:The program is stored in the memory, and the processor executes the at least one program to implement a knowledge classification model training method, wherein the knowledge classification model training method includes:
    获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;Obtaining original annotation data; wherein, the original annotation data includes question stem data, option data and answer data;
    对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
    根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
    将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;Segmenting and concatenating the option attribute value and the answer attribute value to obtain an option answer representation vector;
    将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;performing vector splicing of the question stem characterization vector and the option answer characterization vector to obtain question data;
    根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。A preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on a target topic to obtain knowledge point types.
  10. 根据权利要求9所述的计算机设备,其中,所述对所述题干数据进行编码处理,得到题干表征向量,包括:The computer device according to claim 9, wherein said performing encoding processing on said question stem data to obtain a question stem characterization vector, comprising:
    对所述题干数据进行预处理,将所述题干数据的英文内容转为小写,得到初步题干序列;Preprocessing the stem data, converting the English content of the stem data to lowercase to obtain a preliminary stem sequence;
    对所述初步题干序列进行分词处理,得到题干表征向量。Word segmentation is performed on the preliminary question stem sequence to obtain a question stem representation vector.
  11. 根据权利要求9所述的计算机设备,其中,在所述根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值之前,所述知识分类模型的训练方法还包括:构建所述知识图谱,具体包括:The computer device according to claim 9, wherein, before said option data and answer data are encoded according to the preset knowledge map to obtain option attribute values and answer attribute values, the training method of the knowledge classification model further Including: constructing the knowledge map, specifically including:
    获取预设知识点;Obtain preset knowledge points;
    根据所述预设知识点构建第一三元组和第二三元组;Constructing a first triplet and a second triplet according to the preset knowledge points;
    根据所述第一三元组和所述第二三元组构建所述知识图谱;其中,所述第一三元组包括第一知识实体、关系、第二知识实体,所述第二三元组包括第二知识实体、属性、属性值。Construct the knowledge map according to the first triple and the second triple; wherein, the first triple includes a first knowledge entity, a relationship, a second knowledge entity, and the second triple A group includes a second knowledge entity, an attribute, and an attribute value.
  12. 根据权利要求11所述的计算机设备,其中,所述知识图谱包括第一三元组和多个第二三元组,根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值,包括:The computer device according to claim 11, wherein the knowledge graph includes a first triplet and a plurality of second triplets, and the option data and answer data are encoded according to the preset knowledge graph to obtain option attributes Value and answer attribute values, including:
    根据所述第一三元组和多个所述第二三元组对所述选项数据进行编码处理,得到所述选项属性值;其中,所述选项属性值包括多个第二三元组的属性值;The option data is encoded according to the first triplet and the plurality of second triplets to obtain the option attribute value; wherein the option attribute value includes the values of the plurality of second triplets attribute value;
    根据所述第一三元组和其中一个第二三元组对所述答案数据进行编码处理,得到所述答案属性值;其中,所述答案属性值是所述选项属性值中的多个属性值中的其中一个属性值。The answer data is encoded according to the first triplet and one of the second triplets to obtain the answer attribute value; wherein the answer attribute value is a plurality of attributes in the option attribute value One of the attribute values in the value.
  13. 根据权利要求10至12任一项所述的计算机设备,其中,所述将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量,包括:The computer device according to any one of claims 10 to 12, wherein said performing word segmentation and splicing processing on said option attribute value and said answer attribute value to obtain an option answer characterization vector, comprising:
    将所述选项属性值和所述答案属性值进行词向量化,得到词向量化的项属性值和答案属性值;Carry out word vectorization with described option attribute value and described answer attribute value, obtain the item attribute value and answer attribute value of word vectorization;
    将词向量化的项属性值和答案属性值进行属性值拼接,得到选项答案表征向量。Concatenate the item attribute value and answer attribute value of word vectorization to obtain the option answer representation vector.
  14. 根据权利要求10至12任一项所述的计算机设备,其中,所述将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据,包括:The computer device according to any one of claims 10 to 12, wherein the vector splicing of the question stem characterization vector and the option answer characterization vector to obtain topic data includes:
    通过分隔符将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;其中,分隔符包括第一占位符和第二占位符,通过分隔符将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据,具体包括:The question stem characterization vector and the option answer characterization vector are vector-spliced through a separator to obtain topic data; wherein, the separator includes a first placeholder and a second placeholder, and the question stem is separated by a separator The characterization vector and the option answer characterization vector are vector spliced to obtain the topic data, which specifically includes:
    将题干表征向量设在第一占位符与第二占位符之间、第二占位符设在题干表征向量和选项答案表征向量之间,对题干表征向量和选项答案表征向量进行向量拼接,得到所述题目数据。The question stem characterization vector is set between the first placeholder and the second placeholder, the second placeholder is set between the question stem characterization vector and the option answer characterization vector, and the question stem characterization vector and the option answer characterization vector Perform vector splicing to obtain the subject data.
  15. 一种计算机设备,其中,包括:A computer device, comprising:
    至少一个存储器;at least one memory;
    至少一个处理器;at least one processor;
    至少一个程序;at least one program;
    所述程序被存储在所述存储器中,处理器执行所述至少一个程序以实现一种选择题的知识分类方法,其中,所述选择题的知识分类方法,包括:The program is stored in the memory, and the processor executes the at least one program to implement a knowledge classification method for multiple-choice questions, wherein the knowledge classification method for multiple-choice questions includes:
    获取待分类的选择题数据;其中,所述选择题数据包括题干数据、选项数据和答案数据;Obtain multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data, option data and answer data;
    将所述选择题数据输入至知识分类模型;其中,所述知识分类模型为根据一种知识分类模型的训练方法训练得到;The multiple-choice question data is input into the knowledge classification model; wherein, the knowledge classification model is obtained by training according to a training method of the knowledge classification model;
    通过所述知识分类模型对所述选择题数据行特征提取,得到特征向量信息;Using the knowledge classification model to extract the features of the multiple-choice data row to obtain feature vector information;
    根据所述特征向量信息进行知识分类处理,得到知识点类型;performing knowledge classification processing according to the feature vector information to obtain knowledge point types;
    其中,所述知识分类模型的训练方法,包括:Wherein, the training method of the knowledge classification model includes:
    获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;Obtaining original annotation data; wherein, the original annotation data includes question stem data, option data and answer data;
    对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
    根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
    将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;Segmenting and concatenating the option attribute value and the answer attribute value to obtain an option answer representation vector;
    将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;performing vector splicing of the question stem characterization vector and the option answer characterization vector to obtain question data;
    根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。A preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on a target topic to obtain knowledge point types.
  16. 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种知识分类模型的训练方法,其中,所述知识分类模型的训练方法,包括:A storage medium, the storage medium is a computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to perform training of a knowledge classification model method, wherein the training method of the knowledge classification model includes:
    获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;Obtaining original annotation data; wherein, the original annotation data includes question stem data, option data and answer data;
    对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
    根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
    将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;Segmenting and concatenating the option attribute value and the answer attribute value to obtain an option answer representation vector;
    将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;performing vector splicing of the question stem characterization vector and the option answer characterization vector to obtain question data;
    根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。A preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on a target topic to obtain knowledge point types.
  17. 根据权利要求16所述的存储介质,其中,所述对所述题干数据进行编码处理,得到题干表征向量,包括:The storage medium according to claim 16, wherein said performing encoding processing on said question stem data to obtain a question stem characterization vector comprises:
    对所述题干数据进行预处理,将所述题干数据的英文内容转为小写,得到初步题干序列;Preprocessing the stem data, converting the English content of the stem data to lowercase to obtain a preliminary stem sequence;
    对所述初步题干序列进行分词处理,得到题干表征向量。Word segmentation is performed on the preliminary question stem sequence to obtain a question stem representation vector.
  18. 根据权利要求16所述的存储介质,其中,在所述根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值之前,所述知识分类模型的训练方法还包括:构建所述知识图谱,具体包括:The storage medium according to claim 16, wherein, before the option data and answer data are encoded according to the preset knowledge map to obtain option attribute values and answer attribute values, the training method of the knowledge classification model further Including: constructing the knowledge map, specifically including:
    获取预设知识点;Obtain preset knowledge points;
    根据所述预设知识点构建第一三元组和第二三元组;Constructing a first triplet and a second triplet according to the preset knowledge points;
    根据所述第一三元组和所述第二三元组构建所述知识图谱;其中,所述第一三元组包括第一知识实体、关系、第二知识实体,所述第二三元组包括第二知识实体、属性、属性值。Construct the knowledge map according to the first triple and the second triple; wherein, the first triple includes a first knowledge entity, a relationship, a second knowledge entity, and the second triple A group includes a second knowledge entity, an attribute, and an attribute value.
  19. 根据权利要求16至18任一项所述的计算机设备,其中,所述将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据,包括:The computer device according to any one of claims 16 to 18, wherein the vector splicing of the question stem characterization vector and the option answer characterization vector to obtain topic data includes:
    通过分隔符将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;其中,分隔符包括第一占位符和第二占位符,通过分隔符将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据,具体包括:The question stem characterization vector and the option answer characterization vector are vector-spliced through a separator to obtain topic data; wherein, the separator includes a first placeholder and a second placeholder, and the question stem is separated by a separator The characterization vector and the option answer characterization vector are vector spliced to obtain the topic data, which specifically includes:
    将题干表征向量设在第一占位符与第二占位符之间、第二占位符设在题干表征向量和选项答案表征向量之间,对题干表征向量和选项答案表征向量进行向量拼接,得到所述题目数据。The question stem characterization vector is set between the first placeholder and the second placeholder, the second placeholder is set between the question stem characterization vector and the option answer characterization vector, and the question stem characterization vector and the option answer characterization vector Perform vector splicing to obtain the subject data.
  20. 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储介 质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种选择题的知识分类方法,其中,所述选择题的知识分类方法,包括:A storage medium, the storage medium is a computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to perform a knowledge classification of multiple-choice questions The method, wherein, the knowledge classification method of the multiple-choice questions includes:
    获取待分类的选择题数据;其中,所述选择题数据包括题干数据、选项数据和答案数据;Obtain multiple-choice question data to be classified; wherein, the multiple-choice question data includes question stem data, option data and answer data;
    将所述选择题数据输入至知识分类模型;其中,所述知识分类模型为根据一种知识分类模型的训练方法训练得到;The multiple-choice question data is input into the knowledge classification model; wherein, the knowledge classification model is obtained by training according to a training method of the knowledge classification model;
    通过所述知识分类模型对所述选择题数据行特征提取,得到特征向量信息;Using the knowledge classification model to extract the features of the multiple-choice data row to obtain feature vector information;
    根据所述特征向量信息进行知识分类处理,得到知识点类型;performing knowledge classification processing according to the feature vector information to obtain knowledge point types;
    其中,所述知识分类模型的训练方法,包括:Wherein, the training method of the knowledge classification model includes:
    获取原始标注数据;其中,所述原始标注数据包括题干数据、选项数据和答案数据;Obtaining original annotation data; wherein, the original annotation data includes question stem data, option data and answer data;
    对所述题干数据进行编码处理,得到题干表征向量;Encoding the question stem data to obtain a question stem representation vector;
    根据预设的知识图谱对选项数据和答案数据进行编码处理,得到选项属性值和答案属性值;Encode the option data and answer data according to the preset knowledge graph to obtain option attribute values and answer attribute values;
    将所述选项属性值和所述答案属性值进行分词和拼接处理,得到选项答案表征向量;Segmenting and concatenating the option attribute value and the answer attribute value to obtain an option answer representation vector;
    将所述题干表征向量和所述选项答案表征向量进行向量拼接,得到题目数据;performing vector splicing of the question stem characterization vector and the option answer characterization vector to obtain question data;
    根据所述题目数据对预设的预训练模型进行训练,得到知识分类模型;其中,所述知识分类模型用于对目标题目进行知识分类处理,以得到知识点类型。A preset pre-training model is trained according to the topic data to obtain a knowledge classification model; wherein, the knowledge classification model is used to perform knowledge classification processing on a target topic to obtain knowledge point types.
PCT/CN2022/090718 2021-12-15 2022-04-29 Model training method and apparatus, knowledge classification method and apparatus, and device and medium WO2023108991A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111536048.0 2021-12-15
CN202111536048.0A CN114238571A (en) 2021-12-15 2021-12-15 Model training method, knowledge classification method, device, equipment and medium

Publications (1)

Publication Number Publication Date
WO2023108991A1 true WO2023108991A1 (en) 2023-06-22

Family

ID=80756448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090718 WO2023108991A1 (en) 2021-12-15 2022-04-29 Model training method and apparatus, knowledge classification method and apparatus, and device and medium

Country Status (2)

Country Link
CN (1) CN114238571A (en)
WO (1) WO2023108991A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116955589A (en) * 2023-09-19 2023-10-27 山东山大鸥玛软件股份有限公司 Intelligent proposition method, system, proposition terminal and storage medium based on teaching material knowledge graph
CN117171654A (en) * 2023-11-03 2023-12-05 酷渲(北京)科技有限公司 Knowledge extraction method, device, equipment and readable storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201603A (en) * 2021-11-04 2022-03-18 阿里巴巴(中国)有限公司 Entity classification method, device, storage medium, processor and electronic device
CN114238571A (en) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 Model training method, knowledge classification method, device, equipment and medium
CN115186780B (en) * 2022-09-14 2022-12-06 江西风向标智能科技有限公司 Discipline knowledge point classification model training method, system, storage medium and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111563166A (en) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 Pre-training model method for mathematical problem classification
CN112395858A (en) * 2020-11-17 2021-02-23 华中师范大学 Multi-knowledge point marking method and system fusing test question data and answer data
CN112818120A (en) * 2021-01-26 2021-05-18 北京智通东方软件科技有限公司 Exercise marking method and device, storage medium and electronic equipment
US20210150152A1 (en) * 2019-11-20 2021-05-20 Oracle International Corporation Employing abstract meaning representation to lay the last mile towards reading comprehension
CN113743083A (en) * 2021-09-06 2021-12-03 东北师范大学 Test question difficulty prediction method and system based on deep semantic representation
CN114238571A (en) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 Model training method, knowledge classification method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210150152A1 (en) * 2019-11-20 2021-05-20 Oracle International Corporation Employing abstract meaning representation to lay the last mile towards reading comprehension
CN111563166A (en) * 2020-05-28 2020-08-21 浙江学海教育科技有限公司 Pre-training model method for mathematical problem classification
CN112395858A (en) * 2020-11-17 2021-02-23 华中师范大学 Multi-knowledge point marking method and system fusing test question data and answer data
CN112818120A (en) * 2021-01-26 2021-05-18 北京智通东方软件科技有限公司 Exercise marking method and device, storage medium and electronic equipment
CN113743083A (en) * 2021-09-06 2021-12-03 东北师范大学 Test question difficulty prediction method and system based on deep semantic representation
CN114238571A (en) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 Model training method, knowledge classification method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 25 July 2020, BEIJING FORESTRY UNIVERSITY, CN, article LIAO, ZIHUI: "Intelligent English Grammar Exercises System Based on Knowledge Graph", pages: 1 - 79, XP009546182, DOI: 10.26949/d.cnki.gblyu.2020.000421 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116955589A (en) * 2023-09-19 2023-10-27 山东山大鸥玛软件股份有限公司 Intelligent proposition method, system, proposition terminal and storage medium based on teaching material knowledge graph
CN116955589B (en) * 2023-09-19 2024-01-30 山东山大鸥玛软件股份有限公司 Intelligent proposition method, system, proposition terminal and storage medium based on teaching material knowledge graph
CN117171654A (en) * 2023-11-03 2023-12-05 酷渲(北京)科技有限公司 Knowledge extraction method, device, equipment and readable storage medium
CN117171654B (en) * 2023-11-03 2024-02-09 酷渲(北京)科技有限公司 Knowledge extraction method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
CN114238571A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2023108991A1 (en) Model training method and apparatus, knowledge classification method and apparatus, and device and medium
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN110110054A (en) A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN106886580B (en) Image emotion polarity analysis method based on deep learning
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN104809176A (en) Entity relationship extracting method of Zang language
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN114722069A (en) Language conversion method and device, electronic equipment and storage medium
CN116561538A (en) Question-answer scoring method, question-answer scoring device, electronic equipment and storage medium
WO2023159767A1 (en) Target word detection method and apparatus, electronic device and storage medium
CN113987147A (en) Sample processing method and device
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
CN114841146B (en) Text abstract generation method and device, electronic equipment and storage medium
CN118113855B (en) Ship test training scene question answering method, system, equipment and medium
CN114722774B (en) Data compression method, device, electronic equipment and storage medium
CN115270746A (en) Question sample generation method and device, electronic equipment and storage medium
CN111178080A (en) Named entity identification method and system based on structured information
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
KR102455747B1 (en) System and method for providing fake news detection model using deep learning algorithm
CN115204300A (en) Data processing method, device and storage medium for text and table semantic interaction
CN115730051A (en) Text processing method and device, electronic equipment and storage medium
CN114662496A (en) Information identification method, device, equipment, storage medium and product
CN112015891A (en) Method and system for classifying messages of network inquiry platform based on deep neural network
Rojan et al. Natural Language Processing based Text Imputation for Malayalam Corpora

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22905759

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE