CN109086270B - Automatic poetry making system and method based on ancient poetry corpus vectorization - Google Patents

Automatic poetry making system and method based on ancient poetry corpus vectorization Download PDF

Info

Publication number
CN109086270B
CN109086270B CN201810817519.7A CN201810817519A CN109086270B CN 109086270 B CN109086270 B CN 109086270B CN 201810817519 A CN201810817519 A CN 201810817519A CN 109086270 B CN109086270 B CN 109086270B
Authority
CN
China
Prior art keywords
corpus
poetry
word
words
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810817519.7A
Other languages
Chinese (zh)
Other versions
CN109086270A (en
Inventor
铉静
何伟东
李良炎
何中市
吴琼
郭飞
张航
周泽寻
杜井龙
王路路
陈定定
许祥娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201810817519.7A priority Critical patent/CN109086270B/en
Publication of CN109086270A publication Critical patent/CN109086270A/en
Application granted granted Critical
Publication of CN109086270B publication Critical patent/CN109086270B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic poetry making system based on ancient poetry corpus vectorization and a method thereof. The invention has the beneficial effects that: the machine can fully learn meaning and mood in the poetry, and then obtain the ancient poetry that needs according to the neural network direct input key word after studying when needs are poetry, utilizes the experience study of predecessor to obtain the ability of poetry, also has the artistic aesthetic feeling when satisfying the poetry law.

Description

Automatic poetry making system and method based on ancient poetry corpus vectorization
Technical Field
The invention relates to the technical field of automatic poetry by computers, in particular to an automatic poetry making system and method based on ancient poetry corpus vectorization.
Background
With the continuous promotion of computer technology and hardware computing power, artificial intelligence is closer to the expectation of people, for example, the robot AlphaGo can surpass the world champion of the go by computing, but in the creative or artistic field, the artificial intelligence still can not be competent for related work, for example, Chinese classical poetry is a language art, and the artistic value and the literary achievement are long-running. Ancient poetry possesses regularity and abstract simultaneously, and the tie and the narrow rule of different poetry bodies all have the regulation, and each antithetical couplet still needs the time of a long time to match, and strict regulation makes ancient poetry have pronunciation and the aesthetic feeling in the rhythm, simultaneously because the wide and brisk of chinese culture, the meaning of every word all can have multiple content and different people's understanding, consequently, the creation of ancient poetry needs can make the poetry that is rich in aesthetic feeling and mood after the outstanding poetry study of predecessor, the confluence.
For computers and artificial intelligence, regular work is easy to complete, but abstract creation and artistic aesthetic feeling are the difficulties of poetry made by machines: 1. how to vectorize the natural language into a language which can be read and understood by a machine and enable information contained in the natural language to be stored to the maximum extent; 2. what method can be used to calculate these vectors, allowing the computer to simulate human processing of natural language; 3. how to construct a neural network model can more appropriately represent the relationship between the text data, and the minimum calculation cost is spent; 4. how to solve the training problem by the network design optimization method and the hyper-parameters so as to improve the final effect of the model; 5. if a picture is input for poetry, how to position scenes and themes in the picture and identify the names of objects; 6. and the emotions of the samples generated by the original operation of the machine need to be reserved in the examination and the word change of the level and narrow and the rhyme. At present, parameters such as learning rate of a neural network and model building need to accumulate experience in continuous practice, so that a parameter model suitable for solving the problem is obtained.
Disclosure of Invention
In order to realize the purpose of automatically poetry by a machine, the invention provides an automatic poetry making system and a method thereof based on ancient poetry linguistic data vectorization, each character of historical excellent poetry is converted into linguistic data vectors, and meanwhile, the relationship between the linguistic data vectors is established, so that the machine can fully learn the meaning and the mood in the poetry, further, when poetry is needed, the needed ancient poetry is obtained by directly inputting key characters and words according to a neural network after learning, the poetry making ability is obtained by using experience learning of predecessors, and the poetry law is met while the artistic aesthetic feeling is achieved.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
an automatic poetry making system based on ancient poetry corpus vectorization comprises a corpus processing mechanism, a corpus vector library, an LSTM network model and a poetry screening mechanism;
the corpus processing mechanism is used for converting the operation of corpus vectors and corpus vectors;
the corpus vector library is used for storing corpus vectors;
the LSTM network model is used for generating poetry drafts;
the poetry screening mechanism is used for processing the operations of rhyme and tone undertone of poetry draft;
the corpus processing mechanism is connected with the corpus vector library in a bidirectional mode, and the corpus processing mechanism, the LSTM network model and the poetry screening mechanism are connected in sequence.
Through the design, the automatic poetry making system learns the language relationship and habits among all words in poetry through the LSTM network model, when poetry is needed, the key words are input into the system, the corpus processing mechanism identifies the key words and decomposes the key words to make poetry drafts through the LSTM network model, then the poetry screening mechanism selects contents which accord with rhymes and tone rules as finally determined poetry, and finally poetry results are obtained.
Further, the LSTM network model is a network model composed of two serial layers of LSTM structures, the optimization function of the LSTM network model is to calculate random gradient descent, and the loss function is to calculate cross entropy.
The network model formed by the serial two-layer LSTM structure can more accurately identify the relation between words, but the data volume of the calculation result can be properly reduced due to the fact that the obtained data volume is larger.
Preferably, the LSTM network model discards 20% of the total data after calculation, the learning rate is 0.01, and the number of iterations is 700.
An automatic poetry method based on ancient poetry corpus vectorization comprises the following steps:
s1, inputting ancient poems to a corpus processing mechanism, wherein the corpus processing mechanism converts the characters of the ancient poems into corpus vectors and stores the corpus vectors into a corpus vector library;
s2, building an LSTM network model;
s3, inputting the corpus training set to the LSTM network model to complete the training of the LSTM network model;
s4, image words are input to a corpus processing mechanism, and the corpus processing mechanism calculates to obtain poetry alternative words according to the corpus vectors corresponding to each image word in a corpus vector library;
s5, the corpus processing mechanism inputs the poetry alternative words into the LSTM network model to obtain poetry draft;
and S6, selecting the poems which most accord with the poetry rule from poetry drafts by the poetry screening mechanism according to the rhyme and tone rule of the poetry body to obtain the finalized poetry, wherein the finalized poetry is the poetry result automatically made.
Through the design, a large number of excellent ancient poems enter the corpus processing mechanism, each character of each poem is vectorized, if a skip-gram model is used, corpus vectors are obtained, so that a computer can recognize related contents of each character, the LSTM network model can process the connection relation between the characters, the purpose of understanding the meaning of each character and analyzing the relation between the characters is achieved, the training process of the LSTM network model is the process of learning the characters, when training is completed, the LSTM network model can simply make poems, the contents of regularity such as rhymes, peaceful and narrow tones and the like are processed after poem drafts are made, and poem manuscript setting is finally completed.
Further described, the specific content of step S1 is as follows:
s1.1, inputting ancient poems to a corpus processing mechanism, wherein each character appearing in the ancient poems is divided by the corpus processing mechanism and is marked as m unrepeated characters, and the same character appearing more than once is marked as the same unrepeated character;
s1.2, counting the occurrence frequency of each nonrepeating character and the characters adjacent to the context and appearing in each poem;
s1.3, the corpus processing mechanism sets a random n-dimensional vector for each nonrepeating word, the n-dimensional vector is the corpus vector of the nonrepeating word, the corpus vector is correspondingly stored in a corpus vector library, n belongs to [180, 220], and n is an integer;
s1.4, constructing a Huffman tree, wherein the Huffman tree comprises end nodes and middle nodes, each end node is a child node of the middle node, each middle node is only provided with 2 child nodes, each end node points to a corpus vector of a nonrepeating word in a corpus vector library respectively, the number of times of occurrence of the end node as a corresponding nonrepeating word is recorded, each middle node is recorded with a node value as the sum of the node values of the child nodes, the end node with the larger node value is closer to a root node, and the root node is the middle node with the largest node value;
s1.5, the selection probability of the words adjacent to the context of the corpus vector x on the Huffman tree is as follows:
p(context|x)=Πpi
wherein p isiThe probability of selecting the first child node for the ith intermediate node in the Huffman tree is as follows:
Figure BDA0001740670240000051
x is the corpus vector, theta, input by the intermediate nodeiThe weight of the corpus vector input on the ith intermediate node is obtained;
s1.6, repeating the pair of x and theta by using a gradient descent methodiRespectively calculating partial derivatives:
first, calculate thetaiPartial derivatives of (a):
Figure BDA0001740670240000052
will new thetaiCorrespondingly, after updating to p (context | x), the partial derivative of x is calculated:
Figure BDA0001740670240000053
updating the new x to the corpus vector library correspondingly;
s1.7, reselecting an un-updated corpus vector x and returning to the step S1.5 until each corpus vector x in the corpus vector library is updated once, so as to obtain a new corpus vector library.
Through the design, the n-dimensional vector of each nonrepeating character is initially randomly arranged, but the corresponding corpus vectors after partial derivative operation correspond to the content of the Huffman tree one by one, namely, each corpus vector contains the frequency of the input ancient poetry and the information of the characters adjacent to the ancient poetry in each poetry, the larger the value of n is, the richer the corresponding information is, but the larger the calculated amount is, and the more the x and theta areiCalculating the partial derivatives can more accurately record the paths in Huffman, so that the calculation is more accurate.
Further, the corpus training set is a set composed of 80% corpus vectors in a corpus vector library, and the corpus vectors in the corpus training set are ordered according to the word order in the corresponding ancient poems;
the corpus training set is according to 9: 1, wherein the training corpus is used for training and adjusting parameter setting of the LSTM network model, and the verification corpus is used for verifying and proofreading the LSTM network model after training and adjusting.
The data of the corpus vector library is divided into training corpora and verification corpora, the training corpora are input for learning during training, and the verification corpora are input after learning to verify the learning effect until a good learning effect is achieved.
Further, the image words in step S4 are obtained by inputting images to the image feature extraction model, and the specific method is as follows:
s4.1, inputting an image to an image feature extraction model, wherein the image feature extraction model extracts image words from the image;
and S4.2, the corpus processing mechanism matches corresponding corpus vectors in a corpus vector library for the image words one by one, and the corpus vectors are selected words of the poetry.
The meaning words and phrases can be manually input into the key words and phrases, the corpus processing mechanism identifies the key words and phrases and then correspondingly matches the corpus vectors of the corpus vector library, an image feature extraction model can also be additionally arranged, the image feature extraction model can extract scenes in the image and convert the scenes into words and phrases, at the moment, the words and phrases of key scenes in the image can be obtained only by inputting the image into the image feature extraction model, and the corpus processing mechanism processes the extracted words and phrases to obtain poetry alternative words.
Further described, the image feature extraction model is an improved VGG-16 convolutional neural network model, and comprises a convolutional layer group 1, a pooling layer, a convolutional layer group 2, a pooling layer, a convolutional layer group 3, a pooling layer, a convolutional layer group 4, a pooling layer, a convolutional layer group 5, a pooling layer, 2 convolutional layers, a Bounding-box layer and a Softmax layer, which are connected in sequence, wherein the convolutional layer group 1 and the convolutional layer group 2 are both composed of 2 convolutional layers connected in series, the convolutional layer group 3, the convolutional layer group 4 and the convolutional layer group 5 are respectively composed of 3 convolutional layers connected in series, and each convolutional layer is connected with the Bounding-box layer.
The traditional VGG-16 convolutional neural network structure is 2 convolutional layers, a pooling layer, 3 convolutional layers, 3 pooling layers, a pooling layer, 3 full-link layers and a Softmax layer which are connected in sequence, the improved VGG-16 convolutional neural network model adjusts the 3 full-link layers into the 2 convolutional layers and the Bounding-box layer on the basis of the traditional VGG-16, enables each convolutional layer to be directly connected with the Bounding-box layer to form a full convolutional network, and adjusts the parameters of each convolutional layer through the Bounding-box layer, and in addition, when an input image is large, when more scenes need to be extracted, the convolutional layers can be correspondingly added in front of the Bounding-box layer.
Further, the image word in step S4 is to input a word a, the corpus processing mechanism obtains a subsequent related word by association calculation according to the word a, the word a and the subsequent related word form a word string, and the word string is a poetry alternative;
the method for calculating the subsequent associated words is to find out the next word with the highest matching degree in the corpus vector library according to the corpus vector of the previous word, and the matching degree is calculated as follows:
Figure BDA0001740670240000071
wherein, a is the corpus vector of the previous word and b is the corpus vector of any word in the corpus vector library, and the word corresponding to the corpus vector b satisfying cos (a, b) maximum is the next word.
Inputting a word A, calculating the word B with the highest matching degree with the word A in the corpus vector library by the corpus processing mechanism, calculating the word C with the highest matching degree with the word B, and repeating the steps to finally obtain a plurality of matched words to form a word string, and inputting the word string into the LSTM network model to obtain poems. The method only provides one cue word, and the subsequent content is completely matched and calculated by a machine.
To be further described, after the ancient poetry is input in step S1, the same or similar meaning words are classified to create a meaning word list, and the poetry alternative words in step S4 include the input meaning words and the same or similar meaning words in the meaning word list.
Because of the factors of the character ambiguity and the similar meaning words of the Chinese characters, the description words of the same thing in different poems may be different, the meaning word spectrum is designed to combine the words with the same or similar meaning into one kind, and when the poems formed by the input words are lack of aesthetic feeling, the words can be correspondingly adjusted, and the adjustment mode is selected from the words with the same or similar meaning.
Further, the poem body for automatically making poems is a seven-language poem, and the level and narrow rules are as follows: "level in the middle, zeptos and zeptos, level in the middle, zeptos and zeptos-. The zeptos in the zeptos are flat, the zeptos in the zeptos are flat. And the middle level is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos. The zeptos in the zeptos are flat, the zeptos in the zeptos are flat. 'OR' is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos-zeptozeptozeptozeptozeptozeptos-zeptozeptozeptozeptozeptos-zeptozeptozeptozeptozeptozeptozeptos-zeptozeptos-zeptozeptozeptozeptozeptos-zeptozeptos-zeptos-zeptozeptos-zeptozeptozeptos-zeptos-s-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones. And the middle level is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos. The zeptos in the zeptos are flat, the zeptos in the zeptos are flat. And the middle level is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos. ";
wherein, level-indicates that the vowel of the character is level or zeptop;
the selection method of the flat zeptosis rule in the step 6 is as follows: the poetry screening mechanism compares the poetry draft with the flat and narrow rules one by one, if not, the inconsistent characters are replaced by characters with the same or similar meanings in the image word notation correspondingly, and the flat and narrow rules are compared again until the poetry draft is completely consistent with the flat and narrow rules.
The invention has the beneficial effects that:
1. the cyclic neural network is based on the connection of human brain features and neurons, and the learning of natural language is very close to the learning of human beings on natural language, so that after an LSTM network model is introduced and the learning is carried out on a corpus of big data, a machine can obtain a better generation model, and the logic, poetry and image relation of poetry is processed.
2. The convolutional neural network is prominent in recognition of objects, can extract most of scene features required by people, and also provides rich keywords and image themes for poetry creation.
3. Because the word vector is calculated through the word frequency and the co-occurrence of the poetry corpus, and the co-occurrence of the words reflects the relation between the words, the cosine of the vector calculated through the word vector can reflect the distance of the relation between the words, so that the method can be used for performing rhyme characters, flat and narrow replacement, word cloud expansion and the like, and is convenient and quick to implement by combining with a word classification table of ancient poetry words.
4. The image word spectrum of the poetry can be used for a keyword input step generated by a machine, and the image word spectrum is used for expansion, so that the problems of inconsistent themes and random jumping in most machine poetry systems are solved.
5. The invention uses word string technique to make machine simulate human thinking mode, which is beneficial practice in cognitive engineering and can realize artistic creation intelligence of human writing mechanism in poetry task of machine to a certain extent.
Drawings
FIG. 1 is a block diagram of the system architecture of the present invention;
FIG. 2 is a schematic structural diagram of an LSTM network model of an embodiment;
FIG. 3 is a flow chart of a method of the present invention;
fig. 4 is a detailed flowchart of step S1;
FIG. 5 is a schematic view of Huffman of an embodiment;
FIG. 6 is a schematic diagram of the improved VGG-16 convolutional neural network model structure of the present invention;
FIG. 7 is a schematic structural diagram of an improved VGG-16 convolutional neural network model of an embodiment.
Detailed Description
The invention is described in further detail below with reference to the figures and the embodiments.
As shown in fig. 1, an automatic poetry making system based on ancient poetry corpus vectorization comprises a corpus processing mechanism, a corpus vector library, an LSTM network model and a poetry screening mechanism;
the corpus processing mechanism is connected with the corpus vector library in a bidirectional mode, and the corpus processing mechanism, the LSTM network model and the poetry screening mechanism are connected in sequence.
The LSTM network model in this embodiment is preferably a network model composed of two serial LSTM structures, as shown in fig. 2, where two dotted boxes in the upper and lower parts of the diagram respectively represent one layer of LSTM structure, and each a in the diagrami,jX representing a neuron, input1、X2The input h is the connection relation of two characters for the corpus vectors of two characters connected in sequence in the ancient poetry, namely, every time 1 word connected with two characters is input, the LSTM network model can learn the word relation;
preferably, the optimization function of the LSTM network model is to calculate a random gradient descent, the loss function is to calculate a cross entropy, the LSTM network model discards 20% of the total data after calculation, the learning rate is 0.01, and the number of iterations is 700.
As shown in fig. 3, an automatic poetry method based on ancient poetry corpus vectorization adopts the following steps:
s1, inputting ancient poems to a corpus processing mechanism, wherein the corpus processing mechanism converts the characters of the ancient poems into corpus vectors and stores the corpus vectors into a corpus vector library;
s2, building an LSTM network model;
s3, inputting the corpus training set to the LSTM network model to complete the training of the LSTM network model;
s4, image words are input to a corpus processing mechanism, and the corpus processing mechanism calculates to obtain poetry alternative words according to the corpus vectors corresponding to each image word in a corpus vector library;
s5, the corpus processing mechanism inputs the poetry alternative words into the LSTM network model to obtain poetry draft;
and S6, selecting the poems which most accord with the poetry rule from poetry drafts by the poetry screening mechanism according to the rhyme and tone rule of the poetry body to obtain the finalized poetry, wherein the finalized poetry is the poetry result automatically made.
The specific content of step S1 is as shown in fig. 4:
s1.1, inputting ancient poems to a corpus processing mechanism, wherein each character appearing in the ancient poems is divided by the corpus processing mechanism and is marked as m unrepeated characters, and the same character appearing more than once is marked as the same unrepeated character;
s1.2, counting the occurrence frequency of each nonrepeating character and the characters adjacent to the context and appearing in each poem;
s1.3, the corpus processing mechanism sets a random n-dimensional vector for each nonrepeating word, where the n-dimensional vector is a corpus vector of the nonrepeating word, and stores the corpus vector into a corpus vector library, where n belongs to [180, 220], and n is an integer, where n is preferably 200;
s1.4, constructing a Huffman tree, wherein the Huffman tree comprises end nodes and middle nodes, each end node is a child node of the middle node, each middle node is only provided with 2 child nodes, each end node points to a corpus vector of a nonrepeating word in a corpus vector library respectively, the number of times of occurrence of the end node as a corresponding nonrepeating word is recorded, each middle node is recorded with a node value as the sum of the node values of the child nodes, the end node with the larger node value is closer to a root node, and the root node is the middle node with the largest node value;
preferably, the present embodiment selects two seven-sentence absolute sentences: libai, Wanglushan waterfall, states that the river falls for nine days, with three thousand feet directly under the stream. "and" window "in Dufu" Absolute sentence "includes Xiling Qianqiu Xue, Jia Po Dong Wu Wanli ship. Thus, a Huffman tree is built, wherein the thousand words appear 2 times, and the rest words appear only 1 time, so that the end nodes of the thousand words are closer to the root node than the end nodes of the rest words, and meanwhile, the node value of the thousand words is 2, and the rest words are all 1, and finally the Huffman tree shown in fig. 5 is formed.
S1.5, the selection probability of the words adjacent to the context of the corpus vector x on the Huffman tree is as follows:
p(context|x)=Πpi
wherein p isiThe probability of selecting the first child node for the ith intermediate node in the Huffman tree is as follows:
Figure BDA0001740670240000111
x is the corpus vector, theta, input by the intermediate nodeiThe weight of the corpus vector input on the ith intermediate node is obtained;
s1.6, repeating the pair of x and theta by using a gradient descent methodiRespectively calculating partial derivatives:
first, calculate thetaiPartial derivatives of (a):
Figure BDA0001740670240000112
will new thetaiCorrespondingly, after updating to p (context | x), the partial derivative of x is calculated:
Figure BDA0001740670240000121
updating the new x to the corpus vector library correspondingly;
s1.7, reselecting an un-updated corpus vector x and returning to the step S1.5 until each corpus vector x in the corpus vector library is updated once, so as to obtain a new corpus vector library.
The corpus training set adopted in this embodiment is a set composed of 80% of corpus vectors in a corpus vector library, and the corpus vectors in the corpus training set are ordered according to the word order in the corresponding ancient poetry;
the corpus training set is according to 9: 1, wherein the training corpus is used for training and adjusting parameter setting of the LSTM network model, and the verification corpus is used for verifying and proofreading the LSTM network model after training and adjusting.
In this embodiment, a poetry mode is performed on an input image, that is, the image words in step S4 are image words obtained by inputting an image to an image feature extraction model, and the specific method is as follows:
s4.1, inputting an image to an image feature extraction model, wherein the image feature extraction model extracts image words from the image;
and S4.2, the corpus processing mechanism matches corresponding corpus vectors in a corpus vector library for the image words one by one, and the corpus vectors are selected words of the poetry.
As shown in fig. 6, the image feature extraction model is an improved VGG-16 convolutional neural network model, and includes a convolutional layer group 1, a pooling layer (Pool), a convolutional layer group 2, a pooling layer, a convolutional layer group 3, a pooling layer, a convolutional layer group 4, a pooling layer, a convolutional layer group 5, a pooling layer, 2 convolutional layers, a Bounding-box layer, and a Softmax layer, which are connected in sequence, where the convolutional layer group 1 and the convolutional layer group 2 are each composed of 2 convolutional layers (Conv) connected in series, the convolutional layer group 3, the convolutional layer group 4, and the convolutional layer group 5 are each composed of 3 convolutional layers connected in series, and each convolutional layer is connected to the Bounding-box layer.
The preferred improved VGG-16 convolutional neural network model of this embodiment is the structure shown in fig. 7, and the dotted line portion in the graph is the convolutional portion of the conventional VGG-16 convolutional neural network structure, that is, 2 convolutional layers, pooling layers, 3 convolutional layers, 3 pooling layers, 3 convolutional layers, 3 pooling layers, which are connected in sequence, and then 6 convolutional layers are connected in sequence, and finally a Bounding-box layer and a Softmax layer are connected, compared with the structure of fig. 4, the structure of fig. 5 has 4 convolutional layers added before the Bounding-box layer, so as to obtain the features of more images, where the convolutional core of each convolutional layer is 3 × 3, and the pooling layer is 2 × 2.
Example two: the image words of the step S4 are input as a word a, the corpus processing mechanism obtains a subsequent associated word by association calculation according to the word a, the word a and the subsequent associated word form a word string, and the word string is a poem alternative word;
the method for calculating the subsequent associated words is to find out the next word with the highest matching degree in the corpus vector library according to the corpus vector of the previous word, and the matching degree is calculated as follows:
Figure BDA0001740670240000131
wherein, a is the corpus vector of the previous word and b is the corpus vector of any word in the corpus vector library, and the word corresponding to the corpus vector b satisfying cos (a, b) maximum is the next word.
After the ancient poetry is input in the step S1, words with the same or similar meanings are classified to establish a meaning word list, and the poetry alternative words in the step S4 comprise the input meaning words and words with the same or similar meanings in the meaning word list.
The poetry body for automatically making poetry is a seven-language regular poetry, and the level and narrow rules are as follows: "level in the middle, zeptos and zeptos, level in the middle, zeptos and zeptos-. The zeptos in the zeptos are flat, the zeptos in the zeptos are flat. And the middle level is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos. The zeptos in the zeptos are flat, the zeptos in the zeptos are flat. 'OR' is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos-zeptozeptozeptozeptozeptozeptos-zeptozeptozeptozeptozeptos-zeptozeptozeptozeptozeptozeptozeptos-zeptozeptos-zeptozeptozeptozeptozeptos-zeptozeptos-zeptos-zeptozeptos-zeptozeptozeptos-zeptos-s-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones-tone-tones. And the middle level is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos. The zeptos in the zeptos are flat, the zeptos in the zeptos are flat. And the middle level is zeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptozeptos. ";
wherein, level-indicates that the vowel of the character is level or zeptop;
the selection method of the flat zeptosis rule in the step 6 is as follows: the poetry screening mechanism compares the poetry draft with the flat and narrow rules one by one, if not, the inconsistent characters are replaced by characters with the same or similar meanings in the image word notation correspondingly, and the flat and narrow rules are compared again until the poetry draft is completely consistent with the flat and narrow rules.

Claims (6)

1. An automatic poetry method based on ancient poetry corpus vectorization is characterized by comprising the following steps:
s1, inputting ancient poems to a corpus processing mechanism, wherein the corpus processing mechanism converts the characters of the ancient poems into corpus vectors and stores the corpus vectors into a corpus vector library;
s2, building an LSTM network model;
s3, inputting the corpus training set to the LSTM network model to complete the training of the LSTM network model;
s4, image words are input to a corpus processing mechanism, and the corpus processing mechanism calculates to obtain poetry alternative words according to the corpus vectors corresponding to each image word in a corpus vector library;
s5, the corpus processing mechanism inputs the poetry alternative words into the LSTM network model to obtain poetry draft;
s6, selecting poems which are most consistent with the poem rules from poem drafts by the poem screening mechanism according to the rhyme and tone-tone rules of the poem body to obtain fixed-draft poems, wherein the fixed-draft poems are poem automatic making results;
the specific content of step S1 is as follows:
s1.1, inputting ancient poems to a corpus processing mechanism, wherein each character appearing in the ancient poems is divided by the corpus processing mechanism and is marked as m unrepeated characters, and the same character appearing more than once is marked as the same unrepeated character;
s1.2, counting the occurrence frequency of each nonrepeating character and the characters adjacent to the context and appearing in each poem;
s1.3, the corpus processing mechanism sets a random n-dimensional vector for each nonrepeating word, the n-dimensional vector is the corpus vector of the nonrepeating word, the corpus vector is correspondingly stored in a corpus vector library, n belongs to [180, 220], and n is an integer;
s1.4, constructing a Huffman tree, wherein the Huffman tree comprises end nodes and middle nodes, each end node is a child node of the middle node, each middle node is only provided with 2 child nodes, each end node points to a corpus vector of a nonrepeating word in a corpus vector library respectively, the number of times of occurrence of the end node as a corresponding nonrepeating word is recorded, each middle node is recorded with a node value as the sum of the node values of the child nodes, the end node with the larger node value is closer to a root node, and the root node is the middle node with the largest node value;
s1.5, selecting the characters adjacent to the context of any corpus vector x on the Huffman tree according to the following selection probability:
p(context|x)=∏pi
wherein p isiThe probability of selecting the first child node for the ith intermediate node in the Huffman tree is as follows:
Figure FDA0003450601740000021
x is the corpus vector, theta, input by the intermediate nodeiThe weight of the corpus vector input on the ith intermediate node is obtained;
s1.6, using gradient descent method to x and thetaiRespectively calculating partial derivatives:
first, calculate thetaiPartial derivatives of (a):
Figure FDA0003450601740000022
will new thetaiCorrespondingly, after updating to p (context | x), the partial derivative of x is calculated:
Figure FDA0003450601740000023
updating the new x to the corpus vector library correspondingly;
s1.7, reselecting an un-updated corpus vector x and returning to the step S1.5 until each corpus vector x in the corpus vector library is updated once, so as to obtain a new corpus vector library.
2. The automatic poetry method based on ancient poetry corpus vectorization as claimed in claim 1, characterized in that: the corpus training set is a set formed by 80% of corpus vectors in a corpus vector library, and the corpus vectors in the corpus training set are ordered according to the word order in the corresponding ancient poetry;
the corpus training set is according to 9: 1, wherein the training corpus is used for training and adjusting parameter setting of the LSTM network model, and the verification corpus is used for verifying and proofreading the LSTM network model after training and adjusting.
3. The automatic poetry method based on ancient poetry corpus vectorization as claimed in claim 1, characterized in that: the image words in step S4 are image words obtained by inputting images to the image feature extraction model, and the specific method is as follows:
s4.1, inputting an image to an image feature extraction model, wherein the image feature extraction model extracts image words from the image;
and S4.2, the corpus processing mechanism matches corresponding corpus vectors in a corpus vector library for the image words one by one, and the corpus vectors are selected words of the poetry.
4. The ancient poetry corpus vectorization-based automatic poetry method as claimed in claim 3, wherein: the image feature extraction model is an improved VGG-16 convolutional neural network model and comprises a convolutional layer group 1, a pooling layer, a convolutional layer group 2, a pooling layer, a convolutional layer group 3, a pooling layer, a convolutional layer group 4, a pooling layer, a convolutional layer group 5, a pooling layer, 2 convolutional layers, a Bounding-box layer and a Softmax layer which are sequentially connected, wherein the convolutional layer group 1 and the convolutional layer group 2 are composed of 2 convolutional layers which are connected in series, the convolutional layer group 3, the convolutional layer group 4 and the convolutional layer group 5 are composed of 3 convolutional layers which are connected in series, and each convolutional layer is connected with the Bounding-box layer.
5. The automatic poetry method based on ancient poetry corpus vectorization as claimed in claim 1, characterized in that: the image words of the step S4 are input as a word a, the corpus processing mechanism obtains a subsequent associated word by association calculation according to the word a, the word a and the subsequent associated word form a word string, and the word string is a poem alternative word;
the method for calculating the subsequent associated word is to find out the next word with the highest matching degree in the corpus vector library according to the corpus vector of the previous word, and the matching degree is calculated as follows:
Figure FDA0003450601740000041
wherein, a is the corpus vector of the previous word and b is the corpus vector of any word in the corpus vector library, and the word corresponding to the corpus vector b satisfying cos (a, b) maximum is the next word.
6. The automatic poetry method based on ancient poetry corpus vectorization as claimed in claim 1, characterized in that: after the ancient poetry is input in the step S1, words with the same or similar meanings are classified to establish a meaning word list, and the poetry alternative words in the step S4 comprise the input meaning words and words with the same or similar meanings in the meaning word list.
CN201810817519.7A 2018-07-24 2018-07-24 Automatic poetry making system and method based on ancient poetry corpus vectorization Active CN109086270B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810817519.7A CN109086270B (en) 2018-07-24 2018-07-24 Automatic poetry making system and method based on ancient poetry corpus vectorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810817519.7A CN109086270B (en) 2018-07-24 2018-07-24 Automatic poetry making system and method based on ancient poetry corpus vectorization

Publications (2)

Publication Number Publication Date
CN109086270A CN109086270A (en) 2018-12-25
CN109086270B true CN109086270B (en) 2022-03-01

Family

ID=64838256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810817519.7A Active CN109086270B (en) 2018-07-24 2018-07-24 Automatic poetry making system and method based on ancient poetry corpus vectorization

Country Status (1)

Country Link
CN (1) CN109086270B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309510B (en) * 2019-07-02 2023-05-12 中国计量大学 C-S and GRU-based painting and calligraphy observation method
CN111814488A (en) * 2020-07-22 2020-10-23 网易(杭州)网络有限公司 Poetry generation method and device, electronic equipment and readable storage medium
CN112101006A (en) * 2020-09-14 2020-12-18 中国平安人寿保险股份有限公司 Poetry generation method and device, computer equipment and storage medium
CN112257775B (en) * 2020-10-21 2022-11-15 东南大学 Poetry method by graph based on convolutional neural network and unsupervised language model
CN112434145A (en) * 2020-11-25 2021-03-02 天津大学 Picture-viewing poetry method based on image recognition and natural language processing
CN112883710A (en) * 2021-01-13 2021-06-01 戴宇航 Method for optimizing poems authored by user
CN113051877B (en) * 2021-03-11 2023-06-16 杨虡 Text content generation method and device, electronic equipment and storage medium
CN113553822B (en) * 2021-07-30 2023-06-30 网易(杭州)网络有限公司 Ancient poetry generating model training, ancient poetry generating method, equipment and storage medium
CN116070643B (en) * 2023-04-03 2023-08-15 武昌理工学院 Fixed style translation method and system from ancient text to English

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889366A (en) * 2006-07-13 2007-01-03 浙江大学 Hafman decoding method
CN104951554A (en) * 2015-06-29 2015-09-30 浙江大学 Method for matching landscape with verses according with artistic conception of landscape
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
CN105955964A (en) * 2016-06-13 2016-09-21 北京百度网讯科技有限公司 Method and apparatus for automatically generating poem
CN106569995A (en) * 2016-09-26 2017-04-19 天津大学 Method for automatically generating Chinese poetry based on corpus and metrical rule
CN107102981A (en) * 2016-02-19 2017-08-29 腾讯科技(深圳)有限公司 Term vector generation method and device
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN107480132A (en) * 2017-07-25 2017-12-15 浙江工业大学 A kind of classic poetry generation method of image content-based
CN107832292A (en) * 2017-11-02 2018-03-23 合肥工业大学 A kind of conversion method based on the image of neural network model to Chinese ancient poetry

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10380983B2 (en) * 2016-12-30 2019-08-13 Google Llc Machine learning to generate music from text

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889366A (en) * 2006-07-13 2007-01-03 浙江大学 Hafman decoding method
CN104951554A (en) * 2015-06-29 2015-09-30 浙江大学 Method for matching landscape with verses according with artistic conception of landscape
CN107102981A (en) * 2016-02-19 2017-08-29 腾讯科技(深圳)有限公司 Term vector generation method and device
CN105930318A (en) * 2016-04-11 2016-09-07 深圳大学 Word vector training method and system
CN105955964A (en) * 2016-06-13 2016-09-21 北京百度网讯科技有限公司 Method and apparatus for automatically generating poem
CN106569995A (en) * 2016-09-26 2017-04-19 天津大学 Method for automatically generating Chinese poetry based on corpus and metrical rule
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
CN107480132A (en) * 2017-07-25 2017-12-15 浙江工业大学 A kind of classic poetry generation method of image content-based
CN107832292A (en) * 2017-11-02 2018-03-23 合肥工业大学 A kind of conversion method based on the image of neural network model to Chinese ancient poetry

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Chinese Song Iambics Generation with Neural Attention-based Model;QixinWang et.al;《arXiv:1604.06274v2 [cs.CL]》;20160621;第1-7页 *
Evaluation ofWord Vector Representations by Subspace Alignment;Yulia Tsvetkov et.al;《Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing》;20150921;第2049-2054页 *
中国古典诗词楹联的计算化研究;周昌乐等;《心智与计算》;20121231;第6卷(第2期);第75-82页 *
基于统计抽词和格律的全宋词切分语料库建立;苏劲松等;《中文信息学报》;20070331;第21卷(第2期);第52-57页 *

Also Published As

Publication number Publication date
CN109086270A (en) 2018-12-25

Similar Documents

Publication Publication Date Title
CN109086270B (en) Automatic poetry making system and method based on ancient poetry corpus vectorization
CN108614875B (en) Chinese emotion tendency classification method based on global average pooling convolutional neural network
Ushiku et al. Common subspace for model and similarity: Phrase learning for caption generation from images
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN108153864A (en) Method based on neural network generation text snippet
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN110609849B (en) Natural language generation method based on SQL syntax tree node type
CN111125333B (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
CN110825850B (en) Natural language theme classification method and device
CN110765755A (en) Semantic similarity feature extraction method based on double selection gates
CN113779220A (en) Mongolian multi-hop question-answering method based on three-channel cognitive map and graph attention network
CN107679225A (en) A kind of reply generation method based on keyword
Huang et al. C-Rnn: a fine-grained language model for image captioning
CN113378547A (en) GCN-based Chinese compound sentence implicit relation analysis method and device
CN113344036A (en) Image description method of multi-mode Transformer based on dynamic word embedding
CN114254645A (en) Artificial intelligence auxiliary writing system
CN112257775A (en) Poetry method by graph based on convolutional neural network and unsupervised language model
Wang et al. A text-guided generation and refinement model for image captioning
Poghosyan et al. Short-term memory with read-only unit in neural image caption generator
CN112464673B (en) Language meaning understanding method for fusing meaning original information
CN113392629B (en) Human-term pronoun resolution method based on pre-training model
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN114972907A (en) Image semantic understanding and text generation based on reinforcement learning and contrast learning
CN113065324A (en) Text generation method and device based on structured triples and anchor templates
CN111292741A (en) Intelligent voice interaction robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant