CN110991160A

CN110991160A - Intelligent automatic creation system for study leaving documents

Info

Publication number: CN110991160A
Application number: CN201911353042.2A
Authority: CN
Inventors: 和逸伦
Original assignee: Suzhou Tomorrow Singularity Education Technology Co Ltd
Current assignee: Suzhou Tomorrow Singularity Education Technology Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-04-10

Abstract

The invention discloses an intelligent automatic creation system of a study-keeping document, which relates to the technical field of natural language processing application and comprises data preprocessing, model construction, model training, auxiliary labeling and new document generation, wherein the data preprocessing comprises loading data, converting data and dividing data mini-batch, and the model construction comprises an input layer, an LSTM layer, an output layer, a training error, a loss rate loss and an optimization optimizer. The invention discloses an intelligent automatic creation system for an academic document, wherein a user with an academic demand only needs to input own personalized data, such as graduate colleges, professions, target colleges, professions, personal university scores, English scores, personal abilities, skilled skills, talent hobbies and the like, can quickly and high-quality generate a high-quality document, greatly reduces the problem of academic failure caused by poor document quality in the process of academic, and has a certain application prospect for protecting driving of vast college students who want to go to study.

Description

Intelligent automatic creation system for study leaving documents

Technical Field

The invention relates to the technical field of natural language processing application, in particular to an intelligent automatic creation system for a study-leaving document.

Background

At present, hundreds of AI intelligent creation platforms, Ali creation platforms and the like exist, but the platforms are general creation platforms, technical systems are biased to creation of news, shopping, hotspot tracking and the like, and a solution scheme of vertical customization is not available in the aspects of creation of study-keeping documents, so that the industry urgently needs a study-keeping document only system capable of realizing intelligent automatic creation, the existing technology does not have excellent machine learning algorithm and large-scale training data in the aspects of information extraction, student information mining, professional interpretation of colleges and universities and the like, the requirements of document creation cannot be met in the aspects of knowledge extraction, map construction and strategy training, and therefore, the study-keeping document intelligent automatic creation system is provided.

Disclosure of Invention

The intelligent automatic creation system for the study-reserving documents can quickly generate a high-quality document with high quality, greatly reduces the problem of study-reserving failure caused by poor document quality in the study-reserving process, and protects the driving of the university students who want to seek study.

In order to achieve the purpose, the invention provides the following technical scheme: the intelligent automatic creation system for the study-keeping document comprises data preprocessing, model construction, model training, auxiliary labeling and generation of a new document, wherein the data preprocessing comprises loading data, converting data and dividing data mini-batch, and the model construction comprises an input layer, an LSTM layer, an output layer, a training error, a loss rate loss and optimization optimizer.

Preferably, the most important function in the data preprocessing is to establish a dictionary and a reverse dictionary, use a text file as an input and train an RNN model, then use the RNN model to generate a text similar to the training data, and obtain a dictionary (word- > ID) and a reverse dictionary (ID- > word) of each word in a training sample (10 ten thousand left study documents); each article is changed into a vector consisting of IDs through a dictionary, and then through the ID vector, through embedded circulation, the English name embedding _ lookup is changed into a word vector, and the training label, the English name train _ label, is obtained by training data and shifting backwards one position of the English name train _ data.

Preferably, in the model construction, an LSTM model generates an LSTM basic model by tf, nn, rn, cell, basic LSTM cell given by tensorflow, finally, a sequence _ loss _ by _ example is used to obtain a loss function as a training target, the model is provided with a network model of 512 LSTM cells, model parameters are set to train the model, constants and training parameters are set, 3 symbols are searched in training data in each step in the training process, then 3 symbols are converted into integers to form an input vector, the symbols are converted into integer vectors to serve as input, the input vector is optimized after being converted into a format of an input dictionary, the optimization in the training process and the precision and loss are accumulated to monitor the training process; typically 50000 iterations are sufficient to achieve acceptable accuracy requirements, one training interval of prediction and accuracy data instances (interval 1000 steps), loss and optimizer design, the accuracy of the LSTM can be improved by adding layers.

Preferably, the model training carries out training processing on all the study-leaving documents, and finally the study-leaving documents are converted into a dictionary, a study-leaving document vector and a reverse dictionary through build _ dataset (); obtaining a preprocessed left study document set; the method adopts an LSTM framework with 2 layers, each layer is provided with 128 hidden layer nodes, the batch _ size is set to be 64, and particularly, the method is characterized in that a shuffle is made on training data every time the training is finished; the output appears to be simple to generate, but in practice LSTM generates a prediction probability vector of 112 elements for the next symbol and normalises it with the softmax () function, and the index of the element with the highest probability value is the index value of the predicted symbol in the inverted dictionary.

Preferably, the auxiliary labeling is performed on feature vector information of each category such as professions, schools, past experience rules, current professional current situations, professional development processes, professional development directions, student individual histories, individual interests and hobbies, individual maintenance and the like, the construction efficiency of a training data set can be averagely improved by 8 times after the auxiliary labeling is further performed, the automatic document generation writing system is helped to better understand the composition and internal logic, grammar and color complementing requirements of excellent documents, and therefore the model understanding effect is optimized more quickly and accurately; in natural language processing, many tasks can be converted into sequence tagging tasks, and classified tagging is performed on word/word sequences, such as Named Entity Recognition (NER), Part-of-speech tagging (Part-of-speech tagging), event extraction (eventeextraction), and the like, which are described herein as named entity recognition; named Entity Recognition (NER) refers to the recognition of entities with specific meanings in texts, and mainly comprises school names, professional names, personal hobbies, school names of the department, the department of the department, practice experiences, English achievements, project experiences and the like; named entity identification is an important basic tool in application fields of information extraction, question-answering systems, syntactic analysis, machine translation and the like, and is used as an important step of structured information extraction; introduction of a labeling model a CRF model is adopted to perform a sequence labeling task, and a CRF layer is adopted to realize the sequence labeling task in a labeling part.

Preferably, the new document is generated by long model training to obtain a series of parameters saved in the training process, the parameters are used for generating a text, when a character is input, the next character is predicted, and the new character is input into the model to continuously generate the character, so that the text is formed; in order to reduce noise, the most possible first 5 predicted values are selected for random selection, for example, h is input, the first five with the highest probability of prediction results are [ o, e, i, u, b ], one of the five is selected randomly as a new character, and random factors are added in the process to reduce the generation of noise; the first 32 predictors in the sample study document generated study document are intercepted and if another sequence is entered, i.e. customized according to the user's personalized information, another study document is automatically generated.

Preferably, the intelligent automatic creation system for the study leaving documents comprises the following steps:

firstly, preprocessing data, including loading data, converting the data, separating data mini-bastc, and establishing functions of a dictionary and a reverse-order dictionary;

secondly, model construction is carried out according to data, wherein the model construction comprises an input layer, an LSTM layer, an output layer, a training error, a loss rate loss and an optimization optimizer;

step three, performing model training according to the established model, wherein the model training comprises two layers of LEST frames, and training the left study documents;

performing auxiliary labeling on the characteristic vector information of each category, including named entity identification and performing a sequence labeling task by using a CRF (fuzzy C-frame) model;

and step five, finally generating a new document, wherein the new document comprises the first 32 predicted values in the study reservation document generated by the study reservation document.

Compared with the prior art, the invention has the beneficial effects that: the intelligent automatic creation system for the study-reserving documents can generate a high-quality document quickly and high-quality only by inputting the personalized data of a user with study-reserving requirements, such as graduate colleges, professions, target colleges, professions, personal university scores, English scores, personal abilities, skilled skills, talent and hobbies and the like, greatly reduces the problem of study-reserving failure caused by poor document quality in the study-reserving process, protects driving of vast college students who want to go to study, and has a certain application prospect.

Drawings

FIG. 1 is a flow chart of the intelligent automatic creation system of the study leaving documents of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: the intelligent automatic creation system for the literature leaving study comprises data preprocessing, model construction, model training, auxiliary labeling and new literature generation, wherein the data preprocessing comprises loading data, converting data and dividing data mini-batch, the model construction comprises an input layer, an LSTM layer, an output layer, a training error, a loss rate loss and optimization optimizer.

As shown in S10 in fig. 1, the most important function in data preprocessing is to establish a dictionary and a reverse dictionary, train an RNN model using a text file as input, and then use it to generate a text similar to the training data, where the dictionary (word- > ID) and the reverse dictionary (ID- > word) for each word are obtained in the training sample (10 ten thousand left study documents); each article is changed into a vector consisting of IDs through a dictionary, and then through the ID vector, through embedded circulation, the English name embedding _ lookup is changed into a word vector, and the training label, the English name train _ label, is obtained by training data and shifting backwards one position of the English name train _ data.

As shown in S20 in fig. 1, in the model construction, the LSTM model generates an LSTM basic model using tf, nn, rn, cell, basic LSTM rule given by tenserflow, and finally obtains a loss function using sequence _ loss _ by _ example as a training target, there are 512 network models of LSTM cells, model parameters are set to train the model, constants and training parameters, 3 symbols are retrieved in training data at each step in the training process, then 3 symbols are converted into integers to form an input vector, the symbols are converted into integer vectors as input, and after the symbols are converted into an input dictionary format, optimization is performed again, and the accuracy and loss in the training process are accumulated to monitor the training process; typically 50000 iterations are sufficient to achieve acceptable accuracy requirements, one training interval of prediction and accuracy data instances (interval 1000 steps), loss and optimizer design, the accuracy of the LSTM can be improved by adding layers.

As shown in S30 in fig. 1, model training performs training processing on all the study-leaving documents, and finally converts the study-leaving documents into a dictionary, a study-leaving document vector and a reverse dictionary for obtaining the study-leaving documents through build _ dataset (); obtaining a preprocessed left study document set; the method adopts an LSTM framework with 2 layers, each layer is provided with 128 hidden layer nodes, the batch _ size is set to be 64, and particularly, the method is characterized in that a shuffle is made on training data every time the training is finished; the output appears to be simple to generate, but in practice LSTM generates a prediction probability vector of 112 elements for the next symbol and normalises it with the softmax () function, and the index of the element with the highest probability value is the index value of the predicted symbol in the inverted dictionary.

As shown in S40 in fig. 1, the auxiliary labels are labeled for feature vector information of each category, such as profession, school, past experience rule, current professional status, professional development history, professional development direction, student individual history, individual interest, individual fostering, and the like, and after further labeling, the construction efficiency of the training data set can be averagely improved by 8 times, which helps the automatic document generation and writing system to better understand the composition and internal logic, grammar and retouching requirements of the excellent document, thereby optimizing the model understanding effect more quickly and accurately; in natural language processing, many tasks can be converted into sequence tagging tasks, and classified tagging is performed on word/word sequences, such as Named Entity Recognition (NER), Part-of-speech tagging (Part-of-speech tagging), event extraction (eventeextraction), and the like, which are described herein as named entity recognition; named Entity Recognition (NER) refers to the recognition of entities with specific meanings in texts, and mainly comprises school names, professional names, personal hobbies, school names of the department, the department of the department, practice experiences, English achievements, project experiences and the like; named entity identification is an important basic tool in application fields of information extraction, question-answering systems, syntactic analysis, machine translation and the like, and is used as an important step of structured information extraction; introduction of a labeling model a CRF model is adopted to perform a sequence labeling task, and a CRF layer is adopted to realize the sequence labeling task in a labeling part.

As shown in S50 of fig. 1, the generation of a new document is performed through long model training, a series of parameters saved in the training process are obtained, the parameters are used to generate a text, when a character is input, it predicts the next character, and then the new character is input into the model, so that the character can be continuously generated, and a text is formed; in order to reduce noise, the most possible first 5 predicted values are selected for random selection, for example, h is input, the first five with the highest probability of prediction results are [ o, e, i, u, b ], one of the five is selected randomly as a new character, and random factors are added in the process to reduce the generation of noise; the first 32 predictors in the sample study document generated study document are intercepted and if another sequence is entered, i.e. customized according to the user's personalized information, another study document is automatically generated.

The deep neural network platform TensorFlow in the invention: were originally developed by researchers and engineers from the Google brain group (affiliated with the Google machine intelligence research institute) for machine learning and deep neural network studies.

The recurrent neural network RNN:

RNN is a very popular model that has been shown to be powerful in many tasks of NLP. The RNN-based language model (rnnlm) has two applications: scoring each sequence based on its likelihood of occurrence in the real world, which in effect provides a measure of grammatical and semantic correctness, the language model typically being part of a machine translation system; the language model may be used to generate new text.

The Long Short Term Memory model LSTM is called Long Short Term Memory in English;

the LSTM is a special RNN model, which is provided for solving the problem of gradient diffusion of the RNN model; in the traditional RNN, BPTT is used in a training algorithm, when the time is longer, the residual error needing to be returned is exponentially reduced, so that the network weight is updated slowly, the long-term memory effect of the RNN cannot be embodied, and therefore a storage unit is needed for storing memory, and an LSTM model is proposed;

Bi-LSTM is called Bi-Long Short Term Memory Units in English, and refers to bidirectional LSTM; CRF English global Conditional random field, which refers to Conditional random field; the intelligent automatic creation system for study-leaving documents is called intelligent knowledge system of personals, and refers to a singularity study-leaving intelligent document self-creation system.

1. Parameter optimization

Before the model training, some parameters are initialized, and the parameters mainly comprise: the batch _ size is the number of sequences in a single batch and is adjusted to 64; num _ steps, the number of characters in a single sequence is adjusted to 50; lstm _ size, the number of nodes of hidden layer, adjusted to 128; the num _ layers is the number of LSTM layers and is adjusted to be 3 layers; learning rate, adjusted to 0.001; keep _ prob, the proportion of nodes retained in dropout layer during training is adjusted to 80%.

2. Optimization of training models

RNN can encounter the problems of gradient explosion and gradient diffusion, LSTM solves the problem of gradient diffusion, but gradients can still explode, so we adopt the gradient-clipping method to prevent gradient explosion. I.e. by setting a threshold value, when the gradients exceeds this threshold value, it is reset to the threshold size, which ensures that the gradient does not become very large. In addition, the optimization algorithm is used for clip and gradually reducing the learning rate.

3. Assisted annotation personalization

Different from the traditional general labeling scheme, the system adopts a personalized labeling scheme of the study-keeping documents, and labels characteristic vector information of each category such as professions, schools, past experience rules, current professional current situations, professional development courses, professional development directions, student individual histories, individual interests, hobbies, individual maintenance and the like; after further marking, the construction efficiency of the training data set can be averagely improved by 8 times, and the automatic document generation and writing system is helped to better understand the composition and internal logic, grammar and rendering requirements of excellent documents, so that the model understanding effect is optimized more quickly and accurately.

The main steps of the intelligent automatic creation system of the study leaving documents are expressed, and the method comprises the following steps:

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The intelligent automatic creation system of the study leaving paperwork comprises data preprocessing, model construction, model training, auxiliary labeling and new paperwork generation, and is characterized in that: the data preprocessing comprises loading data, converting data and dividing data mini-batch, and the model construction comprises an input layer, an LSTM layer, an output layer, a training error, a loss rate loss and an optimization optimizer.

2. The intelligent automatic creation system of leaving study documents according to claim 1, characterized in that: the most important function in the data preprocessing is to establish a dictionary and a reverse dictionary, use a text file as input and train an RNN model, then use the RNN model to generate a text similar to the training data, and obtain a dictionary (word- > ID) and a reverse dictionary (ID- > word) of each word in a training sample (10 ten thousand left-handed study documents); each article is changed into a vector consisting of IDs through a dictionary, and then through the ID vector, through embedded circulation, the English name embedding _ lookup is changed into a word vector, and the training label, the English name train _ label, is obtained by training data and shifting backwards one position of the English name train _ data.

3. The intelligent automatic creation system of leaving study documents according to claim 1, characterized in that: in the model construction, an LSTM model generates an LSTM basic model by using tf, nn, rn, cell, basic LSTM cell given by tensoflow, finally, a sequence _ loss _ by _ example is used for obtaining a loss function as a training target, a network model with 512 LSTM units is provided, model parameters are set for training the model, constants and training parameters are set, 3 symbols are searched in training data in each step in the training process, then the 3 symbols are converted into integers to form input vectors, the symbols are converted into the integer vectors to be used as input, the input vectors are optimized after being converted into formats of an input dictionary, the optimization in the training process is carried out, and the precision and the loss are accumulated to monitor the training process; typically 50000 iterations are sufficient to achieve acceptable accuracy requirements, one training interval of prediction and accuracy data instances (interval 1000 steps), loss and optimizer design, the accuracy of the LSTM can be improved by adding layers.

4. The intelligent automatic creation system of leaving study documents according to claim 1, characterized in that: the model training carries out training processing on all the study-leaving documents, and finally the study-leaving documents are converted into a dictionary, a study-leaving document vector and a reverse dictionary through build _ dataset (); obtaining a preprocessed left study document set; the method adopts an LSTM framework with 2 layers, each layer is provided with 128 hidden layer nodes, the batch _ size is set to be 64, and particularly, the method is characterized in that a shuffle is made on training data every time the training is finished; the output appears to be simple to generate, but in practice LSTM generates a prediction probability vector of 112 elements for the next symbol and normalises it with the softmax () function, and the index of the element with the highest probability value is the index value of the predicted symbol in the inverted dictionary.

5. The intelligent automatic creation system of leaving study documents according to claim 1, characterized in that: the auxiliary marking is used for marking the characteristic vector information of each category such as professions, schools, past experience rules, current professional situations, professional development courses, professional development directions, student individual histories, individual interests and hobbies, individual maintenance and the like, the construction efficiency of a training data set can be averagely improved by 8 times after the auxiliary marking is further carried out, and the automatic document generation writing system is helped to better understand the composition and internal logic, grammar and color complementing requirements of an excellent document, so that the model understanding effect is optimized more quickly and accurately; in natural language processing, many tasks can be converted into sequence tagging tasks, and classified tagging is performed on word/word sequences, such as Named Entity Recognition (NER), Part-of-speech tagging (Part-of-speech tagging), event extraction (eventeextraction), and the like, which are described herein as named entity recognition; named Entity Recognition (NER) refers to the recognition of entities with specific meanings in texts, and mainly comprises school names, professional names, personal hobbies, school names of the department, the department of the department, practice experiences, English achievements, project experiences and the like; named entity identification is an important basic tool in application fields of information extraction, question-answering systems, syntactic analysis, machine translation and the like, and is used as an important step of structured information extraction; introduction of a labeling model a CRF model is adopted to perform a sequence labeling task, and a CRF layer is adopted to realize the sequence labeling task in a labeling part.

6. The intelligent automatic creation system of leaving study documents according to claim 1, characterized in that: the new document is generated by long model training to obtain a series of parameters stored in the training process, the parameters are used for generating a text, when a character is input, the next character is predicted, and the new character is input into the model to continuously generate the character, so that the text is formed; in order to reduce noise, the most possible first 5 predicted values are selected for random selection, for example, h is input, the first five with the highest probability of prediction results are [ o, e, i, u, b ], one of the five is selected randomly as a new character, and random factors are added in the process to reduce the generation of noise; the first 32 predictors in the sample study document generated study document are intercepted and if another sequence is entered, i.e. customized according to the user's personalized information, another study document is automatically generated.

7. The intelligent automatic creation system of leaving study documents according to claim 1, characterized in that: the intelligent automatic creation system for the study leaving documents comprises the following steps: