Method for automatically generating novel text emotion curve and predicting recommendation
Technical Field
The invention belongs to an emotion analysis neighborhood in computer natural language processing, and relates to a method for automatically generating a novel text emotion curve and predicting recommendation.
Background
Psychological studies have shown that people tend to feel better with those stories having familiar patterns, and dislike those story lines that are contrary to their own experience. Kurt Vonnegut considers that the emotional curve of the story is the core embodiment of the reading value of the novel; good novels tend to have similar patterns of emotional variation. In order to better analyze the emotional change of the novel text, an emotional curve of the novel text needs to be generated and relevant comparative analysis needs to be carried out.
The problem of emotional curve generation for novel text is still in the exploration phase. Although there are various emotion analysis evaluation methods for paragraphs and short texts, a relatively general combination of text sampling and emotion dictionary mapping is basically used for the task of generating emotion curves, i.e., time series of emotion scores.
In the related art in the past, the problem of how to further assist the text analysis by comparing the emotion curves of novels is not concerned; related work only qualitatively analyzes the sentiment curve of the novel, and methods such as Manhattan Distance (Manhattan Distance) calculation on a fixed-length novel curve (namely, a time sequence with the same time resolution) are adopted for convenient analysis in the work. In fact, methods in terms of time series analysis should be used for such data with a specific individual "time axis".
Traditionally, there is no uniform and good method for the distance-class correlation regression analysis task, and especially for the distance between curves, the distance is different from the distance in the traditional Euclidean space. While these distance metrics may better reflect the relationship between time series, certain metrics may not satisfy the triangle inequality, making such metrics unable to easily apply traditional machine learning methods.
Disclosure of Invention
The purpose of the invention is as follows: the invention mainly aims at the problem that the overall emotion change characteristics of a novel text are not considered in the existing novel text analysis, and provides a method which can comprehensively examine the emotion change similarities and differences among different texts and can give prediction and recommendation of relevant statistics of the novel through a machine learning process.
In order to solve the technical problem, the invention discloses a method for automatically generating a novel text emotion curve and predicting recommendation. All steps of the method run on a Windows platform, a curve is generated for a novel text data set from a Gordburg plan (www.gutenberg.org), and downloading amount prediction and recommendation are given.
The python spaCy toolkit (space. io) used in the present invention is an open source toolkit for natural language processing written by the "expansion AI" organization (twitter.
The labMT emotion vocabulary (neo. imm.dtu.dk/wiki/LabMT) used in the present invention is a supplementary material provided by Peter Shendan Dodds et al in their paper (arxiv.org/abs/1101.5120v 3). The labMT emotion vocabulary is taken from a wide data set, and public emotion scores of main words are obtained by using crowdsourcing service; more than 50 independent evaluations are obtained for each word, so the word emotion scores of the labMT emotion dictionary are extensive and objective.
The specific implementation steps of the conventional Gaussian process in the present invention are prior art and are described in detail in the book Gaussian Processes for Machine Learning (MIT press, 2006) of c.e. rasmussen et al.
The technology related to calculating the dynamic time warping distance of two time series in the invention is the existing technology from the time series analysis field, but the reference in the traditional natural language processing neighborhood is less. The method is mainly introduced into the problem of newly calculating the emotion curves of novel texts, and in the subsequent steps of practical application, the distance matrix generated by the technology is corrected so as to meet the requirements of a subsequent model.
The method mainly comprises the following steps:
step 1, generating an emotional curve of the novel from the novel text.
And 2, calculating a dynamic regular distance matrix between every two emotion curves obtained in the step 1.
And 3, forecasting the downloading amount by utilizing the dynamic regular distance matrix obtained in the step 2 through an improved Gaussian process.
And 4, sorting the corresponding novel texts from small to large according to the distance by using the dynamic regular distance obtained in the step 2, and outputting the novel titles closest to the distance as recommendations.
The step 1 of the invention comprises the following steps:
step 1-1, segmenting the training text and the target text of the novel by using a python natural language processing toolkit spaCy, and removing elements which do not influence the number of effective words of the text, such as punctuation marks and person appellations (such as Mr, Mrs and the like), so as to obtain a word list of the text.
And 1-2, sequentially dividing a word list of the text into word windows, and sequentially calculating the average emotion score of each word window.
And 1-3, sequentially arranging the emotion scores obtained in the step 1-2 to generate a group of time sequences of emotion scores, and calculating a moving average sequence of the time sequences. And the finally obtained moving average sequence is used as the sentiment curve of the novel.
The steps 1-2 of the invention comprise the following steps:
and 1-2-1, equally dividing the word list of the text into text windows according to the size Nw of the word window.
Step 1-2-2, obtaining an emotion score mapping table of common words through a labMT emotion vocabulary table, wherein the form is a mapping function h from the words to emotion scoresavg(w)。
Step 1-2-3, counting words appearing in an emotion score mapping table in a text window and frequency of the appearance of the words;
step 1-2-4, calculating the emotion score h of each text window T by the following formulaavg(T):
Wherein, the words appearing in the emotion score mapping table in the window are respectively w1,w2,…,wNThe total number of words in the table in which the window appears is N, the ith word wiCorresponding sentiment score of havg(wi) I th word wiThe corresponding frequency number in the text window T is fi(T), i ranges from 1 to N.
The step 1-2-1 comprises the following steps:
step 1-2-1-1, aiming at a word list of a text and a size N of a text window needing to be generatedwCalculating the number L of text windows to be divided as L/NwWhere L is the total length of the word list of the text;
step 1-2-1-2, calculating the starting position T of each text window according to the following formulabjAnd an end bitPut Tej:
Tbj=Nw×j+1,
Tej=Nw×(j+1),
Wherein j is 1 … l;
and 1-2-1-3, sequentially generating the segmented text windows according to the starting position and the ending position of each text window in the single text list.
The step 2 comprises the following steps:
step 2-1, aiming at pairwise matching of all novel texts, sequentially selecting two time sequences s corresponding to emotion scores of novel emotion curves1…snAnd t1…tm,snRepresenting a time series s1…snN-th element of (1), tmRepresenting a time series t1…tmThe mth element, n and m are natural numbers, and the window size is set to be w1,
Step 2-2, presetting a matrix DTW with the size of nxm, wherein the direction of the matrix is from bottom to top and then from left to right, the DTW is English shorthand of dynamic time warping (dynamic time warping), the value DTW [0,0] of the leftmost lower corner of the matrix is 0, and all other values are positive infinity;
step 2-3, sequentially inspecting matrix elements positioned in indexes a and b according to the sequence from bottom to top and from left to right; the first row and the first column of the matrix are not considered, if the difference between a and b is larger than w1Also, a is in the range of 1 to n, and b is in the range of 1 to m. Taking the minimum value from the left, lower and lower left matrix elements adjacent to the matrix element, and adding the corresponding element s of the time sequencea,tbThe value of the matrix element currently under investigation is replaced by this new value;
step 2-4, returning a value DTW [ n, m ] of the uppermost right corner of the DTW matrix as a dynamic time regular distance between two target emotion score time sequences;
and 2-5, repeating the steps 2-1 to 2-4 until the dynamic time warping distances between every two texts are obtained, and arranging the dynamic time warping distances into a dynamic time warping distance matrix.
Step 3 of the invention comprises the following steps:
and 3-1, logarithm is taken for the download quantity data of the training text to obtain the logarithm download quantity y of the training data.
Step 3-2, calculating the minimum eigenvalue lambda of the dynamic regular distance matrix K generated by the training textmin。
Step 3-3, inputting noise level
Using lambda obtained in step 3-2
minAnd (3) correcting the same: if λ
min>0, no change is made; if λ
min<0, then noise level
It is also necessary to add-lambda
min。
Step 3-4, for the corrected noise level
And calculating by using the traditional Gaussian process to give prediction of the download amount of the novel.
The step 3-4 comprises the following steps:
step 3-4-1, inputting a dynamic regular distance matrix K, a logarithm download quantity y of a training data target value novel text and a corrected noise level
And a dynamic warping distance matrix k from the target to the training data
*;
Step 3-4-2, calculating a matrix
Cholesky decomposition matrix L of
1Wherein I represents an identity matrix;
step 3-4-3, calculating kernel function k*Coefficient matrix α:
α=L1 T\(L1\y),
the operation symbol A \ B represents solving the linear equation AX ═ X in B;
step 3-4-4, calculating the target logarithmic download quantity f*:
f*Namely the predicted value of the download amount.
The invention solves the problems of information loss and redundancy easily caused by the limitation of technology and subsequent purposes when the emotion curve is generated by sampling the text by generating the emotion curve which can adapt to the length of the text. Therefore, the method for generating the curve can reflect the emotional change of the novel text more accurately. And the accuracy can be verified in subsequent tasks.
The invention solves the problem of applying the dynamic time warping distance to the relevant statistical quantity prediction by means of a modified gaussian process in step 3. The actual modifications made here, while appearing to be a simpler procedure, have been subject to strict theoretical proof and experimental verification. Theoretically, it can be proved that the correction method provided in step 3 can ensure that the given matrix is definite, thereby ensuring the usability of the kernel function and solving the problem of the positive nature of applying the dynamic regularized distance to the gaussian process and even the general kernel method. 1000 groups of simulation experiments show that even for random data, the ratio of negative characteristic values contained in the dynamic regular distance matrix does not exceed 5%; in the case of negative eigenvalues, the ratio of the maximum positive eigenvalue to the minimum negative eigenvalue modulo is also both higher than 25; this means that the method does not have a large impact on the original distance characteristics, while ensuring usability. Moreover, the improvement can be perfectly integrated into the frame of the original Gaussian process, and the method is convenient.
The invention pioneers quantitative prediction of novel relevant statistics by exploiting the relationship between the emotion curves of novel text. Specifically, the topological structure of the novel text emotion curve is described by introducing dynamic regular distance in time series analysis, and regression analysis is performed on novel download quantity by utilizing an improved Gaussian process.
The improved gaussian process used in the method disclosed by the present invention solves the problems of the prior art. The improvement is proved to be reasonable through theoretical verification and feasibility through experiments, is simple and easy to implement, and can be perfectly fused into an original Gaussian process framework.
Has the advantages that: the method and the device provide beneficial reference for analyzing the emotion change trend of the novel by generating the emotion curve of the given novel text. The improved Gaussian process included in the method disclosed by the invention can accept a wider distance function as a kernel function, so that the application range of the Gaussian process is expanded, and the accuracy is indirectly improved for the related prediction. The method utilizes the relation between the emotion curves of the novel text to predict the download amount of the target text, is a completely innovative method, and has stronger positive correlation compared with the prediction given by the traditional method. The invention provides another brand-new angle for the relevance recommendation of the novel through the comparison of the dynamic time warping distance.
Drawings
The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of the present invention.
FIG. 2 is an illustration of the invention generating an emotion curve.
FIG. 3 is a prior art generation of an emotion curve.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
As shown in FIG. 1, the invention discloses a method for automatically generating an emotion curve of a novel text and giving a novel recommendation which is most similar to the emotion curve and is predicted by the downloading amount. The method mainly comprises the following steps:
and 11, segmenting the training text and the target text of the novel by using a python natural language processing tool kit spaCy, removing punctuation marks through spaCy labeling, and removing character title acronyms (such as Mr, Mrs and the like) through a text template matching mode to obtain a word list of the text.
And step 12, equally dividing the word list of the text into text windows according to the size Nw of the text windows.
Step 13, obtaining the emotion score mapping table of the common words through the labMT emotion vocabulary table, wherein the form is a mapping function h from the words to the emotion scoresavg(w)。
And step 14, counting words appearing in the emotion score mapping table in the text window and the frequency of the appearance of the words.
Step 15, calculating the emotion score of each text window T, wherein the formula is as follows:
wherein, the words with windows appearing in the emotion score mapping table are w respectively1,w2,…,wNThe total number of words whose window appears in the table is N, the word wiCorresponding sentiment score of havg(wi) Word wiThe corresponding frequency count in the text window T is fi(T)。
And step 16, sequentially arranging the emotion scores of the windows to generate a group of time sequences of emotion scores.
And step 17, calculating the moving average sequence of the emotion score time sequence obtained in the step 16, namely replacing the emotion score of each point of the original emotion score time sequence with the average value of the emotion scores of the adjacent points of the point. The moving average sequence is the emotion curve as a novel.
And step 18, calculating a dynamic regular distance matrix between every two emotion curves.
And step 19, logarithm is taken on the data of the download amount of the training text to obtain a predicted value y of the training data.
Step 20, for the dynamic regular distance matrix K generated by the training text, calculating the minimum eigenvalue lambda thereofmin。
Step 21, inputting noiseLevel of
Using lambda obtained in step 3-2
minAnd (3) correcting the same: if λ
min>0, no change is made; if λ
min<0, then noise level
It is also necessary to add-lambda
min。
Step 22, for the corrected noise level
And calculating by using the traditional Gaussian process to give prediction of the download amount of the novel.
And step 23, sorting the corresponding novel texts from small to large according to the dynamic regular distance matrix, and outputting the novel titles with the closest distances as recommendations.
Step 12 of the present invention comprises the steps of:
step 24: word list for text and size N of text window to be generatedwCalculating the number L of text windows to be divided as L/NwWhere L is the total length of the word list of the text.
Step 25: the start position T of each text window is calculated according to the following formulabjAnd an end position Tej:
Tbj=Nw×j+1,
Tej=Nw×(j+1),
Wherein j is 1 … l;
step 26: sequentially generating segmented text windows according to the starting position and the ending position of each window in the text single list
Step 18 of the present invention comprises the steps of:
step 27, aiming at pairwise matching of all novel texts, sequentially selecting two time sequences s corresponding to emotion scores of novel emotion curves1…sn,t1…tm,snRepresenting a time series s1…snN-th element of (1), tmRepresenting a time series t1…tmThe mth element, n and m are natural numbers, and the window size is set to be w1,
Step 28, a matrix DTW with a size of n × m is preset, and the direction of the matrix is from bottom to top and then from left to right, where the DTW is an english abbreviation of dynamic time warping (dynamic time warping), a value DTW [0,0] at the leftmost lower corner of the matrix is 0, and all other values are positive infinity.
Step 29, sequentially inspecting matrix elements positioned in the indexes a and b according to the sequence from bottom to top and from left to right; the first row and the first column of the matrix are not considered, if the difference between a and b is larger than w1Also, a is in the range of 1 to n, and b is in the range of 1 to m. Taking the minimum value from the left, lower and lower left matrix elements adjacent to the matrix element, and adding the corresponding element s of the time sequencea,tbThe value of the matrix element currently under consideration is replaced by this new value.
And step 30, returning the value DTW [ n, m ] of the uppermost right corner of the DTW matrix as the dynamic time regular distance between the two target emotion score time sequences.
And 31, repeating the steps 27-30 until the dynamic time warping distance between every two texts is obtained. The dynamic time warping distances are arranged into a matrix of dynamic time warping distances.
The step 22 of the present invention comprises the steps of:
step 32, inputting a dynamic regular distance matrix K (namely a Gaussian process kernel function), a logarithm download quantity y of a training data target value novel text and a noise level
Dynamic warping distance matrix k from target to training data
*;
Step 33, calculate the matrix
Cholesky decomposition matrix L of
1Wherein I represents an identity matrix.
Step 34, calculating kernel function k*Coefficient matrix α:
α=L1 T\(L1\y),
the operation symbol a \ B represents X in solving the linear equation AX ═ B.
Step 35, calculating the target logarithmic download quantity f*:
Examples
The algorithm used by the invention is completely written and realized by Python language. The experimental configuration was an Intel (R) core (TM) i5-4200M processor with a primary frequency of 2.5G HZ, memory of 4G, Python version 3.5.3, release Anaconda 3.
Experimental data were prepared as follows: 1729 English novel texts from the Gutenberg plan, wherein the total number of text words is over 10000, and the monthly capacity of the texts is over 100; fiction related statistics obtained through the gurdenburg plan website: including the name of the novel, the amount downloaded.
Example 1
The emotion curve experiment in the embodiment for generating the novel text is as follows:
11. and inputting a training text corpus and a testing text corpus, and preprocessing to obtain a text word list.
12. And (4) generating an emotion curve of the text by using the word list obtained in the step (11), and generating a compared emotion curve as comparison according to a previous method.
Example 2
In the embodiment, the prediction experiment of the download amount is given by comparing the emotion curves of the novel texts as follows:
11. and inputting a training text corpus and a testing text corpus, and preprocessing to obtain a text word list.
12. And generating an emotion curve of the text by using the word list obtained in the step 11.
13. And calculating an emotional curve dynamic time regular distance matrix.
14. The logarithmic download amount of the test text is given by the distance matrix and the improved gaussian process.
Example 3
In the embodiment, the recommendation experiment of the relevant text given by comparing the emotion curves of the novel text is as follows:
11. and inputting a training text corpus and a testing text corpus, and preprocessing to obtain a text word list.
12. And generating an emotion curve of the text by using the word list obtained in the step 11.
13. And calculating an emotional curve dynamic time regular distance matrix.
14. And sequencing the related texts through a dynamic time warping distance matrix and recommending according to the distance from small to large.
The invention aims to improve an emotion curve generation method of a novel text and make relevant prediction recommendation, and needs to provide a method capable of accurately reflecting emotion change characteristics of an original text and improving positive correlation of prediction downloading quantity. In order to verify the effectiveness of the invention, the invention is compared with the traditional method for generating the emotion curve and a plurality of traditional models.
The emotion curves generated by the present invention are shown in fig. 2, and the emotion curves generated by the conventional method are shown in fig. 3, in which the vertical axes of the two graphs represent emotion scores of novel texts, and the horizontal axis represents positions of corresponding sampling windows in the texts (that is, time points of emotion score time series). Both figures generate the emotional curves of the novel < Alice's adventure in Wonderland >. It can be seen that although the present invention uses fewer sampling windows (temporal resolution), it better embodies the emotional variations of the novel text. Taking Alice dream travel wonder as an example, after 80% of text, the original text is in a state of sharp change, a great amount of negative emotions are expressed in the trial judgment of king and queen, and then the dream is awakened to be calm. The invention well represents the emotional change characteristics of the text; and the conventional method can only see the situation of the text emotion regression mean. Meanwhile, it is noted that the accuracy improvement of the emotion curve of the novel text is in causal relation and consistent with the improvement of the correlation coefficient predicted by a subsequent model, and the accurate drawing of the emotion curve of the novel text is to better improve the objective data of the prediction downloading amount.
Table 1 is a comparison of the prediction of the amount of downloaded target text given by the modified gaussian process:
TABLE 1
The outcome data of the invention is in the last row. Compared with the traditional plain text characteristic and the curve generation method given by the predecessor, the prediction result of the download amount has higher positive correlation.
Table 2 shows an example of recommendation results according to the emotional curve similarity:
TABLE 2
It can be seen that in table 2, for the invention that makes recommendations only by means of emotional curves, another revised text representation of the original novel successfully given by the invention is taken as the closest recommended novel, which illustrates the rationality of the invention in recommendation by means of emotional curves; also, table 2 gives a summary of the closer proximity in the emotional curves, illustrating the utility of the invention.
The invention provides a method for automatically generating a novel text emotion curve and predicting recommendation, and the emotion curve generated by the method can more accurately reflect the text emotion change condition. The download prediction method provided by the invention is an innovative method, has different points from the prior art, and focuses on utilizing the overall emotional change of the novel text; compared with the traditional text feature method, the method can obtain higher positive correlation when the actual download quantity of the independent new text is predicted. The emotion curve recommended by the invention is closest to the novel text, and has reasonability and uniqueness, thereby providing a brand new angle for the recommendation task related to the novel text.
The present invention provides a method for automatically generating a novel text emotion curve and predicting a recommendation, and a method and a way for implementing the method are many, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.