CN117556789A - Student comment generation method based on multi-level semantic mining - Google Patents

Student comment generation method based on multi-level semantic mining Download PDF

Info

Publication number
CN117556789A
CN117556789A CN202311550403.9A CN202311550403A CN117556789A CN 117556789 A CN117556789 A CN 117556789A CN 202311550403 A CN202311550403 A CN 202311550403A CN 117556789 A CN117556789 A CN 117556789A
Authority
CN
China
Prior art keywords
text
comment
representing
student
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311550403.9A
Other languages
Chinese (zh)
Inventor
熊余
何承阳
蔡婷
黄容
储雯
王盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202311550403.9A priority Critical patent/CN117556789A/en
Publication of CN117556789A publication Critical patent/CN117556789A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Educational Technology (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Educational Administration (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Tourism & Hospitality (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a student comment generation method based on multi-level semantic mining, which belongs to the technical field of semantic mining and specifically comprises the following steps: s1: student data is obtained, a table descriptive text is generated through a text generation method, and conditional probability of a predicted text is obtained; s2: considering the semantic similarity of the sequence level, aligning the structured data with the reference natural language text; s3: extracting information contained in a form in the model predictive text, matching and comparing the information with input form data, and checking the accuracy and reliability of the predictive text; s4: and obtaining fluent and coherent accuracy comment text through semantic similarity prediction and sentence sequence prediction.

Description

Student comment generation method based on multi-level semantic mining
Technical Field
The invention belongs to the technical field of semantic mining, and relates to a student comment generation method based on multi-level semantic mining.
Background
Along with the development of education informatization, related researchers attach more and more importance to innovative informatization evaluation tools, promote the application of student digital files in evaluation, and change a student evaluation mode with examination results as unique standards. However, when applied to a specific educational evaluation scenario, student comment generation models are generally dedicated to precisely mining student data, and although ensuring the authenticity and reliability of the generated comments, they cause the phenomena of incoherence and weak grammar of the evaluation text. Such a poor grammatical assessment may lead to a difficulty for the reader to understand the student's overall quality and well-being, thereby deviating from the original intent of establishing a student assessment regimen. Therefore, how to solve the above-mentioned grammar and fluency problems of student evaluation writing is a problem which needs to be solved at present.
Today, large-scale pre-trained language models are widely used, but are particularly used in different scenes, and model adjustment and improvement are required according to scene characteristics. In the task of generating a form to a text, due to the characteristics of different input data and different output forms, the model ensures the loyalty of a predicted text to the input data when aiming at precisely mining the input information, but can cause the phenomena of incoherence and weak grammar of the output text. In the past few years, the rapid development of deep learning technology has prompted advances in the field of natural language generation, and many neural network-based natural language generation models have achieved good results in practice. Although the existing text generation model can generate a text with good intra-sentence continuity, it is difficult to plan a continuous sentence in the whole text, and the generated text may have the conditions of repetition, incoherence and poor grammar. All of the above phenomena are due to the inability of the model to adequately capture the input semantic information and convert it to a sentence-level representation. There are many ways to enhance the model to generate fluent, coherent text, such as convolutional neural networks, encoder-decoder models, pre-trained language models, data enhancement, etc., which are all directed to enhancing the sentence-level characterizability of the model. Sentence-level representation is an important research direction of natural language generation, which means that semantic and grammatical information in natural language sentences is converted into vector representations so that a computer can understand and operate the information, and a natural language generation model can generate more natural and fluent accurate text. Therefore, the sentence-level representation is helpful to improve the performance and reliability of natural language generation technology research in practical application, and can make a great contribution to the generation quality of student comments.
The latest natural language generation method based on deep learning starts in 2015 and uses a very successful encoder-decoder architecture for end-to-end training, unlike the traditional pipeline model. This type of model does not require explicit modeling of any content planning task, does not involve sub-problems of human partitioning, uses encoder-decoder architecture to involve intermediate operations in the neural network. Specifically, the original data is input from the input end, the prediction result is output from the output end through data processing, then the prediction result is compared with the reference content, the obtained error is propagated reversely, and then the parameters of each layer are adjusted until the model converges. The model as proposed in the literature [ LUONG M T, PHAM H, MANNING C D. Effective approaches to attention-based neural machine translation [ C ]. Empirical Methods in Natural Language Processing, lisbon, portugal,2015:1412-1421 ] uses an encoder to convert an input Sequence into a fixed length token vector, which is then converted into a target output Sequence by a decoder. The model omits the data marking work which is high in cost and easy to make mistakes, improves the efficiency of solving the problem of text generation by the system by reducing the manual preprocessing work and increasing the overall fit of the model, has stronger learning ability and expansibility, becomes a main stream model of a task of generating data to text, and has good application prospect in the fields of weather forecast, character biography, case generation and the like.
The natural language generation model can generate a text with rich and smooth information under the support of a large-scale data set and an end-to-end model. However, it is not always feasible to collect large scale labeled datasets in various fields, and training data is limited as one of the important factors that lead to poor model performance. Along with the development of deep learning, a large number of pre-training models are launched into the public view, and a large number of language knowledge can be learned on a large-scale data set through pre-training, so that rich priori knowledge can be provided for downstream tasks, and the method can be applied to the target field with less marked data. Therefore, many researches in the field of natural language generation perform migration learning through a pre-training model, so that the loss of training a downstream task can be reduced, and the convergence speed of the optimization process of the downstream task can be increased through fine tuning. Knowledge-based Pre-Training models (KGPT) proposed by the literature [ CHEN W, SU Y, YAN X, et al KGPT: knowledges-grouped Pre-Training for data-to-text generation [ C ]. Empirical Methods in Natural Language Processing, punta Cana, dominican Republic,2020:8635-8648 ] annotate input data by means of remote supervision, enabling the same excellent performance of baseline models trained on a large number of annotation datasets to be achieved with a small number of samples (few-shot). Models proposed by documents [ CHANG E, SHEN X, ZHU D, et al, biological data-to-text generation with LM-based text augmentation [ C ]. European Chapter of the Association for Computational Linguistics, beijin, china,2021:758-768 ] expand the size of text samples by slot value substitution and text sample enhancement, and then match forms with structured text, effectively solving the problem of scarcity of form-text data pairs. The model presented in the literature [ CHEN Z, EAVANI H, CHEN W, et al, few-shot NLG with pre-trained language model [ C ]. Association for Computational Linguistics, seattle, washington, USA,2020:183-190 ] combines the content to be selected in the input form with the generated vocabulary in the language model via a pointer network and GPT-2 to achieve the goal of generating accurate text. In addition, the author also creates a multi-domain table data set to prove that the model has good universality in a few-shot scene. The advanced characterization long text generation model proposed by the literature [ XING X, WAN X.Structure-aware pre-training for a table-to-text generation [ C ]. Association for Computational Linguistics, bangkok, thailand,2021:2273-2278 ] creates long and coherent text by modeling sentence and speech levels, further enhancing the performance of the pre-training model in the form-to-text generation task by overlaying three self-supervising tasks of the form language model, adjacent region prediction and context reconstruction.
However, current research also has the following drawbacks: (1) There are also some areas where the encoder-decoder model that performs well needs to be improved, for example, the encoder may lose some important information in converting the input text into a fixed vector, and performance loss due to complexity, variability and long-range dependence problems of the input text when processing complex context information. (2) When generating student archive comments, repeated text and the existence of nonsensical words may be encountered, and a large amount of training data is required to avoid problems such as overfitting. This is because the model may tend to reuse some common vocabulary or sentence patterns during the learning process, which repetition may lead to monotonicity and lack of diversity of the comments. Additionally generated comments may contain some nonsensical words or sentences because the model has difficulty understanding the context and semantics of the entire comment during the learning process.
Disclosure of Invention
Therefore, the invention aims to provide a student comment generation method based on multi-level semantic mining, so that monotonicity and weak grammar of student comments generated in an education scene are made up, and sentence fluency of the student comments is improved.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a student comment generation method based on multi-level semantic mining specifically comprises the following steps:
s1: student data is obtained, a table descriptive text is generated through a text generation method, and conditional probability of a predicted text is obtained;
s2: considering the semantic similarity of the sequence level, aligning the structured data with the reference natural language text;
s3: extracting information contained in a form in the model predictive text, matching and comparing the information with input form data, and checking the accuracy and reliability of the predictive text;
s4: and obtaining fluent and coherent accuracy comment text through semantic similarity prediction and sentence sequence prediction.
Further, in step S1, each student information of the input data is represented as attribute value pairs, and conditional probabilities from inputting the original table to outputting the predicted text in an autoregressive manner are as follows:
wherein P (y|S) represents the generated comment probability representation obtained by taking structural chemogenetic data as input; y is t Target text representing the t-th time step, y <t Representing the target text before the time t, representing each piece of student information of the input data as a K pair of attribute value pairs, and representing the structural chemical student data asi represents the number of attribute value pairs, and the comment text with the corresponding length L is represented as Y=y 1 ,y 2 ,...,y L The method comprises the steps of carrying out a first treatment on the surface of the The training objective of the model output part is to maximize the likelihood of the reference text, i.e. minimize the negative log likelihood function, the loss function L generated by the comment LM The following are provided:
further, in step S2, the aligning the structured data with the reference natural language text by considering the semantic similarity at the sequence level specifically includes:
encoding an input form into a sequence, splicing the sequence and a teacher comment text into a new text X, and matching the X with a generated text Y; generating text y= (Y) 1 ,y 2 ,y 3 ,...,y n ) And Y '= (Y' 1 ,y′ 2 ,y′ 3 ,...,y′ m ) In a discrete distribution, the loss function of text alignment is:
d(y i ,y′ j )=||y i -y′ j || 2
wherein,and->Discrete intervals of distribution, u, y and y', respectively i Representing the desire to generate a text probability distribution, delta yi Representing standard deviation, sigma, of probability distribution of generated text j Representing a desire to represent a joint text probability distribution, +.>Representing standard deviation of joint text probability distribution; u is the joint distribution of μ and σ, d (y i ,y′ j ) Is from y to y'Cost of y i Generated text representing the ith time step, y' j The joint text representing the jth time step, m representing the length of the joint text, n representing the length of the generated text.
Further, step S3 extracts information contained in a table in the model predictive text, matches and compares the information with input table data, and checks accuracy and reliability of the predictive text, and specifically includes the following steps:
extracting information contained in the table in the text, matching and comparing the information with the input table data, and generating a text y from the model 1:T Extracting all possible fields f i And a corresponding value v i Represented as recordsOriginal input form record +.>The pseudo tag as a structure is used for learning by an information extraction module, and the loss function is as follows:
wherein N is the number of sentences, T i For the length of each sentence, K is the number of tags, r i,j,k Representing the original input form record label,representing a predictive record.
Further, in step S4, a smooth and coherent accuracy comment text is obtained through semantic similarity prediction and sentence sequence prediction, which specifically includes the following steps:
the cosine similarity is used for predicting the semantic similarity between every two sentences, and the closer the calculation result is to 1, the closer the semantics of the two sentences are, and the specific calculation process is as follows:
wherein,and->Respectively represent two sentences (y i ,y j ) The kth word in (a) represents that N is the maximum total number of words in the sentence, L Sim Predicting loss for semantic similarity;
for sentence sequential prediction, a special character, "< sen >" is first inserted at the end of each sentence, the corresponding token of the special characterThe method is used for deducing the speaking relation with other sentences, firstly, a known ordered text is used for guiding model training, an existing teacher writes comments as a reference text of the task, and the disturbed teacher writes comments as input, and the specific calculation process is as follows:
wherein L is D Loss for sequential prediction, q ij To judge the predictive score in order, g ij Is a label, and represents y when the value is 1 i In y j Previously, the other cases were all 0, W o Is a parameter matrix;
the K-means clustering algorithm is used for judging the sentence sequence of the predicted text,firstly, selecting k sentences as the centers of the clusters, calculating the distances from all samples to the centers of the clusters, dividing the clusters where the samples are located according to the distances, updating the centers of the clusters by calculating the distance average value, and continuously repeating the process until the centers of the clusters are not changed; all samples y and cluster center C j Distance d (y, C) j ) Also calculated by cosine distance formula, then cluster center C is updated by argmin function j The loss function of the algorithm is shown as follows:
the loss function of the sentence iteration interactive module is shown as the following formula:
L SEN =L sim +L Dis
the total loss function of the updated model is shown in the following formula:
L=L PFSRG +L SRGSR =L LM1 L SR2 L TM3 L SEN
wherein lambda is 1 、λ 2 、λ 3 Reconstruction loss L as a table SR Table-text content matching loss L TM The scale factor hyper-parameters of statement iteration interaction loss are used to control their relative importance.
The invention has the beneficial effects that:
(1) In the student comment generation task, the final comment content quality is improved from the aspects of continuity, accuracy and readability from the multiple levels of word level representation, sentence level representation and comment level representation of the model prediction comment text.
(2) According to the method, the fact that students should be attached to students as much as possible is considered, instead of template comment content with different sizes is used, the structured input form is encoded into a sequence, the sequence and teacher comment text are spliced to obtain a new text, and an optimal transmission method is considered to match the new text with the generated text.
(3) The invention distinguishes each sentence by inserting a special character at the end of each sentence and requires training of the model using known ordered text, by which the ability of the decoder to simultaneously capture advanced representations of the sentence sequence is improved.
(4) The invention uses K mean value clustering algorithm to judge the sentence sequence of the predicted text, in this way, the sentences of the predicted text can be judged and rewritten under the condition of no supervision, thereby improving the accuracy and consistency of model output.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a student comment generation method for multi-level semantic mining;
FIG. 2 is a table-text matching constraint schematic;
FIG. 3 is a schematic diagram of semantic similarity prediction;
fig. 4 is a schematic diagram of statement order prediction.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
Referring to fig. 1, a student comment generation method based on multi-level semantic mining is provided. The method is mainly used for student evaluation generation tasks in educational scenes. The technical problems to be solved are as follows: the generated student assessment suffers from weak grammatical and fluency problems.
Step one: conditional probability acquisition of predictive text. For the task of generating structured data of students to comments, each row in the input data of the students represents a student entity, each column is attribute information of the student, and each student has a plurality of attribute information. Each student information of the input data is expressed as a K pair attribute value pair, which is expressed asThe corresponding length L of the comment text is denoted y=y 1 ,y 2 ,...,y L . First, form descriptive text is generated by generating a model to selfThe conditional probability from inputting the original form to outputting the predicted text in the regression mode is shown as the following formula:
wherein y is t Target text representing the t-th time step, y <t Representing the target text before time t. The training objective of the model output part is to maximize the likelihood of the reference text, i.e. to minimize the negative log likelihood function, and therefore the loss function L generated by the comment LM The definition is as follows:
for structural chemogenetic dataThe predicted comment text Y with the length L needs to achieve a coherent comment content generation target on the basis of ensuring accurate understanding of input data, and a text obtained by converting a form of the predicted text is matched with a teacher comment text in content to obtain a new coherent text; then extracting information, matching and comparing the information with the student input form data to obtain student comment text with higher accuracy and usability; and then, carrying out sentence iterative interaction through predicting the similarity among sentences and sequence, and finally obtaining accurate, smooth and coherent predicted comment text, wherein the whole flow is shown in figure 1.
Step two: and the consistency of the text is improved on the premise of not sacrificing the accuracy of the predicted comment. The goal of the traditional content matching task is to align the structured data with the reference natural language text to generate context and grammar compliant descriptive text. Some form-to-text studies focus only on matching the generated text to the reference text, or using a pointer network to select duplicate form content or to generate from a corpus. The former may cause loss of fidelity and the latter may destroy the grammar and semantic features in the prior knowledge of the pre-trained model, thereby degrading the consistency of the generated text.
Considering that student comments should fit the actual situation of students as much as possible, rather than template comment contents of different sizes, comments to be generated contain more important information in the input form, so that the students are expected to approach to form contents and existing teacher comment texts at the same time, and a form-text content matching module is designed to solve the problems. The structured input form is first encoded into a sequence, e.g., "grade:7,homework time:5hours" is serialized into a sentence "grade is 7,homework time is 5hours". Then splicing the new text X with the teacher comment text through "< >", and matching the X with the generated text Y.
The Seq2Seq model is mostly trained by maximum likelihood estimation, i.e. predicting the next word on the basis of the previous partial word, but this leads to errors in the training and deriving stages and accumulates along the generation trajectory of the sequence, which in practice may lead to unstable results. Inspired by the optimal transmission (Optimal Transport, OT), considering the semantic similarity at the sequence level rather than the word level in this task, the transmission distance of the generated text Y to the joint text Y' will be minimized, as shown in fig. 2.
Generating text y= (Y) 1 ,y 2 ,y 3 ,...,y n ) And Y '= (Y' 1 ,y′ 2 ,y′ 3 ,...,y′ m ) And the distribution intervals are mu and sigma respectively, and the OT distance refers to the minimum transmission cost of y from the distribution space mu to sigma. The loss function calculation process of the module is as follows:
d(y i ,y′ j )=||y i -y′ j || 2
wherein,and->The discrete intervals of distribution of y and y', respectively, U being the combined distribution of μ and σ, d (y i ,y′ j ) And is the cost of y to y'.
Step three: and extracting attribute and attribute value information in the new predicted text for accuracy verification, and training the accuracy of the original input form record serving as a pseudo tag to ensure that the output text is more consistent with the form content. The information extraction technology is to extract structured data from natural language text, and common information extraction methods include named entity recognition (for recognizing entities, such as names of people and places, etc.), relation extraction (for recognizing entity relations), dependency syntax analysis (for recognizing sentence structures, grammar relations, and dependency relations among vocabularies), event extraction (for recognizing specific time and extracting related attributes and parameters), and the like.
The information extraction module is used for further extracting information contained in the form in the model predictive text, matching and comparing the information with input form data, and checking the accuracy and reliability of the predictive text so as to increase the usability of the model output text, so that named entity recognition is needed to extract entity information in the model output text. The NLTK is used as a natural language processing package, provides a plurality of text processing functions such as word segmentation, part-of-speech tagging, named entity recognition and the like, and is applied to modeling and analysis of NLG tasks. A sequence classifier is trained based on the entity classes marked in the training set, and then used to predict new text and identify named entities therein. During training, NLTK breaks text into a series of labels, such as vocabulary and punctuation, and takes these as input features for the classifier. At the same time, NLTK also uses some feature extractors, such as part-of-speech tags and context information, to improve the accuracy of the classifier. Thus, here text y is generated from the model using NLTK tools 1:T Extracting all possible fields f i And a corresponding value v i Represented as recordsOriginal input form record +.>The pseudo tag as a structure is used for learning by the information extraction module. The specific loss function calculation process is as follows:
wherein N is the number of sentences, T i For the length of each sentence, K is the number of tags, r i,j,k Representing the original input form record label,representing a predictive record.
Step four: semantic similarity between each sentence is measured through cosine similarity based on considering semantic representation of each word through semantic similarity prediction and sentence sequence prediction, and the distance between similar sentences in a text sequence is minimized. The human always understands the semantics and organizes the correct speaking relationship according to the existing content when writing articles, and strict sentence sequence and clear logic relationship are also required in student comments, so capturing the advanced characterization in the input sequence context is very important for predicting the generation of the subsequent sequence. A typical generative model usually predicts the next word based on the attention to all prefix words, thereby training the unidirectional decoder to ignore co-occurrence information outside the word level. Therefore, in order to make the model capture high-level characterization improve the decoder and enable the decoder to represent prefix information of sentence level and sentence level, a sentence iteration interaction module based on semantic similarity prediction and sentence sequence prediction is provided, and a semantic similarity prediction model diagram is shown in fig. 3.
For semantic similarity prediction, the last word of each sentence is characterizedSemantic information of the current sentence can be aggregated, since semantically similar sentences have approximate representation in vector space, by any two sentences (y i ,y j ) Corresponding characterization of->The similarity between them can be predicted. The cosine similarity is used for predicting the semantic similarity between every two sentences, and the closer the calculation result is to 1, the closer the semantics of the two sentences are, and the specific calculation process is as follows:
wherein,and->Respectively represent two sentences (y i ,y j ) The kth word in (a) represents that N is the maximum total number of words in the sentence, L Sim The penalty is predicted for semantic similarity. The method can improve the ability of the encoder to understand the meaning of the prefix sentences. A sentence sequential prediction model diagram is shown in fig. 4.
For sentence sequential prediction, a special character, "< sen >" is inserted at the end of each sentence to distinguish each sentenceCan be used to derive the speech relationship between other sentences, and +.>In comparison with (I)>More attention is paid to the relationship with other sentences. The sentence sequence is determined by learning the sentence characterization in pairs to distinguish whether the relative sequence of two sentences is correct. Without reference text, an unsupervised approach may be used directly to predict statement order, such as clustering, self-encoder or variational self-encoder, etc., but in this case the predicted statement order may not be unique and may be relatively low in accuracy. Therefore, the task firstly needs to use the known sequential text to guide model training, takes the existing teacher written comment as the reference text of the task, takes the disturbed teacher written comment as input, and the specific calculation process is as follows:
wherein L is D Loss for sequential prediction, q ij To judge the predictive score in order, g ij Is a label, and represents y when the value is 1 i In y j Previously, the other cases were all 0, W o Is a parameter matrix.
The ability of the decoder to simultaneously capture advanced representations of sentence sequences is improved by the supervised learning method, followed by unsupervised sequence prediction for model predicted student comments. Judging the sentence sequence of the predicted text by using a K-means clustering algorithm, firstly selecting K sentences as the centers of the clusters, calculating the distances from all samples to the cluster centers, dividing the clusters where the samples are located according to the distances, updating the cluster centers by calculating the distance means, and repeating the process until the cluster centers are not changed. All samples y and cluster center C j Distance d (y, C) j ) Also calculated by cosine distance formula, howeverThen updating the cluster center C through argmin function j The loss function of the algorithm is shown as follows:
thus, the loss function of the statement iteration interaction module is defined as:
L SEN =L sim +L Dis
finally, updating the total loss function of the model:
L=L PFSRG +L SRGSR =L LM1 L SR2 L TM3 L SEN
wherein lambda is 1 、λ 2 、λ 3 Reconstruction loss L as a table SR Table-text content matching loss L TM The scale factor hyper-parameters of statement iteration interaction loss are used to control their relative importance.
Those of ordinary skill in the art will appreciate that all or some of the steps in the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, where the program may be executed to implement the steps of the method, where the storage medium includes: ROM/RAM, magnetic disks, optical disks, etc.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (5)

1. A student comment generation method based on multi-level semantic mining is characterized by comprising the following steps of: the method specifically comprises the following steps:
s1: student data is obtained, a table descriptive text is generated through a text generation method, and conditional probability of a predicted text is obtained;
s2: considering the semantic similarity of the sequence level, aligning the structured data with the reference natural language text;
s3: extracting information contained in a form in the model predictive text, matching and comparing the information with input form data, and checking the accuracy and reliability of the predictive text;
s4: and obtaining fluent and coherent accuracy comment text through semantic similarity prediction and sentence sequence prediction.
2. The student comment generation method based on multi-level semantic mining according to claim 1, wherein the method comprises the following steps: in step S1, the conditional probabilities from the input of the original table to the output of the predicted text in an autoregressive manner are as follows:
wherein P (y|S) represents the generated comment probability representation obtained by taking structural chemogenetic data as input; y is t Target text representing the t-th time step, y <t Representing the target text before the time t, representing each piece of student information of the input data as a K pair of attribute value pairs, and representing the structural chemical student data asi represents the number of attribute-value pairs and the comment text of corresponding length L is represented as y=y 1 ,y 2 ,...,y L The method comprises the steps of carrying out a first treatment on the surface of the The training objective of the model output part is to maximize the likelihood of the reference text, i.e. minimize the negative log likelihood function, the loss function L generated by the comment LM The following are provided:
3. the student comment generation method based on multi-level semantic mining according to claim 1, wherein the method comprises the following steps: in step S2, the aligning the structured data with the reference natural language text by considering the semantic similarity at the sequence level specifically includes:
encoding an input form into a sequence, splicing the sequence and a teacher comment text into a new text X, and matching the X with a generated text Y; generating text y= (Y) 1 ,y 2 ,y 3 ,...,y n ) And the joint text Y '= (Y' 1 ,y′ 2 ,y′ 3 ,...,y′ m ) In a discrete distribution, the loss function of text alignment is:
d(y i ,y′ j )=||y i -y′ j || 2
wherein,and->Discrete intervals of distribution, u, y and y', respectively i Representing a desire to generate a text probability distribution, +.>Representing standard deviation, sigma, of probability distribution of generated text j Representing a desire to represent a joint text probability distribution,representing standard deviation of joint text probability distribution; u is the joint distribution of μ and σ, d (y i ,y′ j ) For the cost of y to y', y i Generated text representing the ith time step, y' j Representing the union of the jth time stepText, m represents the length of the joint text, and n represents the length of the generated text.
4. The student comment generation method based on multi-level semantic mining according to claim 1, wherein the method comprises the following steps: and step S3, extracting information contained in a table in the model predictive text, matching and comparing the information with input table data, and checking the accuracy and reliability of the predictive text, wherein the method specifically comprises the following steps of:
extracting information contained in the table in the text, matching and comparing the information with the input table data, and generating a text y from the model 1:T Extracting all possible fields f i And a corresponding value v i Represented as recordsOriginal input form recordThe pseudo tag as a structure is used for learning by an information extraction module, and the loss function is as follows:
wherein N is the number of sentences, T i For the length of each sentence, K is the number of tags, r i,j,k Representing the original input form record label,representing a predictive record.
5. The student comment generation method based on multi-level semantic mining according to claim 1, wherein the method comprises the following steps: and step S4, obtaining fluent and coherent accuracy comment text through semantic similarity prediction and sentence sequence prediction, wherein the method specifically comprises the following steps of:
the cosine similarity is used for predicting the semantic similarity between every two sentences, and the closer the calculation result is to 1, the closer the semantics of the two sentences are, and the specific calculation process is as follows:
wherein,and->Respectively represent two sentences (y i ,y j ) The kth word in (a) represents that N is the maximum total number of words in the sentence, L Sim Predicting loss for semantic similarity;
for sentence sequential prediction, a special character, "< sen >" is first inserted at the end of each sentence, the corresponding token of the special characterThe method is used for deducing the speaking relation with other sentences, firstly, a known ordered text is used for guiding model training, an existing teacher writes comments as a reference text of the task, and the disturbed teacher writes comments as input, and the specific calculation process is as follows:
wherein L is D Loss for sequential prediction, q ij To judge the predictive score in order, g ij Is a label, and represents y when the value is 1 i In y j Previously, the other cases were all 0, W o Is a parameter matrix;
judging the sentence sequence of the predicted text by using a K-means clustering algorithm, firstly selecting K sentences as the centers of the clusters, calculating the distance from all samples to the cluster centers, dividing the clusters where the samples are located according to the distance, updating the cluster centers by calculating the distance means, and repeating the process until the cluster centers are not changed; all samples y and cluster center C j Distance d (y, C) j ) Also calculated by cosine distance formula, then cluster center C is updated by argmin function j The loss function of the K-means clustering algorithm is shown in the following formula:
the loss function of the sentence iteration interactive module is shown as the following formula:
L SEN =L sim +L Dis
the total loss function of the updated model is shown in the following formula:
L=L PFSRG +L SRGSR =L LM1 L SR2 L TM3 L SEN
wherein lambda is 1 、λ 2 、λ 3 Reconstruction loss L as a table SR Table-text content matching loss L TM The scale factor hyper-parameters of statement iteration interaction loss are used to control their relative importance.
CN202311550403.9A 2023-11-20 2023-11-20 Student comment generation method based on multi-level semantic mining Pending CN117556789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311550403.9A CN117556789A (en) 2023-11-20 2023-11-20 Student comment generation method based on multi-level semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311550403.9A CN117556789A (en) 2023-11-20 2023-11-20 Student comment generation method based on multi-level semantic mining

Publications (1)

Publication Number Publication Date
CN117556789A true CN117556789A (en) 2024-02-13

Family

ID=89820108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311550403.9A Pending CN117556789A (en) 2023-11-20 2023-11-20 Student comment generation method based on multi-level semantic mining

Country Status (1)

Country Link
CN (1) CN117556789A (en)

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
Kang et al. Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
US10614106B2 (en) Automated tool for question generation
Qiu et al. DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN112541356B (en) Method and system for recognizing biomedical named entities
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN110489555A (en) A kind of language model pre-training method of combination class word information
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
CN113743099A (en) Self-attention mechanism-based term extraction system, method, medium and terminal
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Parvin et al. Transformer-based local-global guidance for image captioning
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN117933258A (en) Named entity identification method and system
CN116757195B (en) Implicit emotion recognition method based on prompt learning
Göker et al. Neural text normalization for turkish social media
Shi Algorithmic Translation Correction Mechanisms: An End-to-end Algorithmic Implementation of English-Chinese Machine Translation
Hughes Automatic inference of causal reasoning chains from student essays
CN116029300A (en) Language model training method and system for strengthening semantic features of Chinese entities
Naghshnejad et al. Recent Trends in the Use of Deep Learning Models for Grammar Error Handling
Tolegen et al. Voted-perceptron approach for Kazakh morphological disambiguation
Cui DRIIS: Research on Automatic Recognition of Artistic Conception of Classical Poems Based on Deep Learning
CN114372467A (en) Named entity extraction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination