CN112668318A

CN112668318A - Work author identification method based on time sequence

Info

Publication number: CN112668318A
Application number: CN202110273383.XA
Authority: CN
Inventors: 李泽朋; 潘正颐; 侯大为; 马元巍; 顾徐波; 张焱
Original assignee: Changzhou Weiyizhi Technology Co Ltd
Current assignee: Changzhou Weiyizhi Technology Co Ltd
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-04-16

Abstract

The invention discloses a work author identification method based on time sequence, which comprises the steps of firstly converting text data into time sequence data according to Zipf law; then, performing time domain feature extraction on the sample data converted into the time sequence through Tsfresh, and performing dimension reduction on the text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh; and finally, realizing Stacking model fusion by using the XGboost, LightGBM and SVM machine learning methods, realizing author prediction of the text according to the extracted text characteristics, and finishing author attribution judgment of the text. The work author identification method based on time series deduces whether other works are works of the author according to the works existing by the author.

Description

Work author identification method based on time sequence

Technical Field

The invention relates to the technical field of computers, in particular to a work author identification method based on time series.

Background

Chinese patents (application numbers CN201310043297.5, application date 20130202, publication numbers CN103106192B, and publication date 20160203) disclose a literary work author identification method and apparatus, which introduce that a word segmentation is performed on an input literary work to obtain a word segmentation phrase and a target occurrence frequency corresponding to the word segmentation phrase, an information entropy of the input literary work is calculated according to the target occurrence frequency, an information entropy of an author sample work and an author sample work corresponding to a target author is obtained, and whether an author of the input literary work is a target author is identified by comparing the information entropy of the author sample work with the information entropy of the input literary work. However, the patent does not extract text features from the perspective of time series, and the temporal features of the text may be ignored.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in order to solve the problems in the background art, a time-series-based work author identification method is provided, so that the authorship judgment of a text is realized, and whether other works are works of an author can be inferred according to the works of the author.

The technical scheme adopted by the invention for solving the technical problems is as follows: a work author identification method based on time sequence comprises the following specific steps:

firstly, converting text data into time series data according to a Zipf law;

secondly, extracting time domain features of the sample data converted into the time sequence through the Tsfresh, reducing dimensions of text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh, and performing feature selection and feature dimension reduction on the text from the angle of the time sequence in the process;

and finally, realizing Stacking model fusion through an XGboost machine learning method, a LightGBM machine learning method and an SVM machine learning method, realizing author prediction of the text according to the extracted text characteristics, and finishing author attribution judgment of the text.

Further specifically, in the above technical solution, in the first step, the frequency of occurrence of words in the text data is arranged in descending order according to the Zipf law, the serial numbers are given in turn, and the text data can be converted into time series data by replacing the words in the text data with the corresponding serial numbers.

Further specifically, in the above technical solution, in the step one, the text data is a data set obtained from a website, the data set includes a plurality of works of a plurality of authors, the works are made into samples, and corresponding labels are respectively printed on the samples, and the samples are divided into two parts according to a random distribution method, wherein one part of the works is distributed as a training set, and the other part of the works is distributed as a testing set.

Further specifically, in the above technical solution, in the step two, the feature selection technology selects the feature with explanatory and importance according to the corresponding tag.

In the above technical solution, in the step two, the principal component analysis method is to select a feature that can represent the most text features.

Further specifically, in the above technical solution, in the third step, the XGBoost and the LightGBM are used as primary learners, the SVM is used as a secondary learner, the two primary learners are respectively trained by using a training set, cross validation is performed, the results output by the two primary learners are spliced to generate a secondary training set, then the secondary learner is trained by using the secondary training set, and the trained model is predicted to the test set to obtain a result.

The invention has the beneficial effects that: the work author identification method based on the time sequence extracts the text characteristics from the angle of the time sequence, realizes the judgment of the author identity of the text, and can deduce whether other works are also works of the author according to the existing works of the author.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of text feature extraction and text feature dimension reduction;

FIG. 2 is a schematic diagram of Stacking model fusion.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1 and fig. 2, the method for identifying the author of the work based on the time sequence of the present invention includes the following specific steps:

firstly, arranging the occurrence frequency of words in text data according to a Zipf law from large to small, sequentially giving serial numbers, and correspondingly replacing the words in the text data by the serial numbers to convert the text data into time sequence data. Wherein, the mode of serial number giving is: the word with the highest frequency of occurrence is given the rank 1, the word with the second highest frequency of occurrence is given the rank 2, and so on.

It should be noted that: "the text data can be converted into time series data by replacing words in the text data with the corresponding serial numbers", the first half sentence explains the set serial numbers, and the second half sentence explains the text replaced with the serial numbers. For example, if the sequence numbers of the words "I love you, but you don't love me" are respectively set to "you" is 1, "love" is 2, "but" is 3, "me" is 4, "don't" is 5, "," is 6, "I" is 7, the sequence number is 721631524.

The text data is a data set obtained from a website, the data set comprises a plurality of works of a plurality of authors, the works are made into samples, corresponding labels are respectively printed on the samples, the samples are divided into two parts according to a random distribution method, one part of the works are distributed into a training set, and the other part of the works are distributed into a testing set. Wherein, the setting mode of the label is as follows: in several authors, the work belonging to author one is labeled 0, the work belonging to author two is labeled 1, and so on.

Zipf's law arranges the frequency of occurrence of words in text data in order from large to small, so the frequency of occurrence of words named r obeys the power law relationship:

（1）

where P denotes frequency and a denotes a specific constant.

It indicates that in text data, only a very few words are frequently used, and the vast majority of words are rarely used.

And secondly, extracting time domain features of the sample data converted into the time sequence through the Tsfresh, reducing dimensions of the text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh, and performing feature selection and feature dimension reduction on the text from the angle of the time sequence.

The feature selection technology is to select features having explanatory and important properties according to the corresponding tags. Principal component analysis is the selection of features that best represent the characteristics of the text.

Feature extraction: tsfresh is a time series feature extraction tool based on scalable hypothesis testing, which contains a variety of feature extraction methods and robust feature selection algorithms, and Tsfresh can automatically extract thousands of features from a time series, which describe basic features of the time series, such as the number of peaks, the average or the maximum, or more complex features, such as time reversal symmetry statistics, etc. After the text is converted into the time sequence, the time sequence is one-dimensional time data, and some characteristics of the text are difficult to express, so that the characteristics capable of expressing the characteristics of the text can be extracted through Tsfresh, and the characteristics can be used for constructing a machine learning model.

The peak value refers to the value of the difference between the highest value or the lowest value of the time series to the average value in one period.

The average value is the value of each point in the time series divided by the total number of points, and the specific formula is as follows:

（2）

where mean represents the average value, i represents the ith time point, t represents the value of the time point, and S represents the number of time points in the time series.

The value having the maximum value of the points in the time series is used

Where t represents the value at the time point and S represents the number of time points in the time series.

And (3) feature dimensionality reduction: the time series usually contains noise and other irrelevant information, so that part of the extracted features contains interference information, the features related to the text can be selected by a dimension reduction method, Tsfresh evaluates the interpretability and importance of each feature on the label of the sample, and the method is based on a perfect hypothesis test theory, adopts a plurality of test methods, and can effectively select the features of the text data. Principal Component Analysis (PCA) converts a plurality of variables into a few uncorrelated composite variables to more comprehensively reflect the entire data set; this is because there is a certain correlation between the original variables in the data set, and information between the original variables can be integrated with fewer integrated variables, which are called principal components, and the principal components are not correlated with each other, i.e. the information represented by the integrated variables does not overlap.

The XGboost and the LightGBM are used as primary learners, the SVM is used as a secondary learner, the two primary learners are respectively trained by using a training set for cross validation, the results output by the two primary learners are spliced to generate a secondary training set, then the secondary learner is trained by using the secondary training set, and the trained model is predicted to obtain a test set.

The XGboost is an optimized distributed Gradient enhancement library, aims to realize high efficiency, flexibility and portability, realizes a machine learning algorithm under a Gradient Boosting framework, provides parallel tree Boosting (also called GBDT and GBM), can quickly and accurately solve a plurality of data science problems, is a tree Boosting method, and is a decision tree as a basic classification model.

For the training data set:

（3）

wherein the content of the first and second substances,

d denotes a data set, x denotes a feature of the data set, y denotes a label of the data set, n denotes an nth sample data, i denotes an ith sample, R denotes a real number space, and m denotes a feature dimension of the sample as an m dimension.

The final model of the whole XGBoost is:

（4）

wherein F represents the space of the decision tree,

；

a leaf label representing the decision tree,

；

representing a decision tree structure; g represents a decision function; f represents a characteristic function of the decision tree; m represents that the characteristic dimension of the sample is m dimension; r represents a real number space; t represents the number of leaf labels of the decision tree; the label on the specific tree leaf is composed of

Is decided so that

Viewed as a vector

The value of each dimension is the label of one leaf;

representing the kth decision function.

The learning function of XGboost is:

（5）

the ultimate goal of learning is to minimize

Wherein, in the step (A),

is a loss function;

is a regularization term to prevent overfitting;

、

is the object of parameter adjustment.

Second, LightGBM is a more powerful, faster model than XGBoost, has very big promotion in the performance, compares the advantage that has with traditional algorithm and is: the training efficiency is higher, the accuracy is higher, the parallelization learning is supported, and large-scale data can be processed.

LightGBM is a decision number algorithm based on Histogram, and is characterized in that continuous floating point characteristic values are discretized into Z integers, and a Histogram with the width of Z is constructed; when data is traversed, the discretized value is used as the cumulative statistic in the histogram, after the word of data is traversed, the histogram accumulates the needed statistic, and then the optimal segmentation point is searched in a traversing mode according to the discretized value of the histogram.

LightGBM adopts a leaf-wise growth strategy, finds out one leaf with the largest splitting gain (generally, the largest data volume) from all the current leaves each time, then splits, and circulates in such a way, so that better precision can be obtained; a disadvantage of Leaf-wise is that a deeper decision tree may be grown, resulting in an overfitting. LightGBM therefore adds a maximum depth limit above the Leaf-wise, preventing overfitting while ensuring high efficiency.

The basic principle of the Support Vector Machine (SVM) is to find an optimal classification surface, so that the distance between the closest point in several classes to the classification surface and the classification surface is the largest. In that

And

between two classes (which can be expanded into multiple classes), there are multiple classification planes to separate the two classes accurately.

These classification planes may be defined as:

wherein

Is the inner product of the vector;

is a scalar. The support vector machine finds the optimal classification surface in the classification surfaces

，

Is characterized in that

And

middle distance

The nearest point, and

the sum of the distances between them is the largest, also called having the largest spacing.

Fourthly, fusing the Stacking model: based on the machine learning model, the invention uses a Stacking fusion model, the core idea of Stacking is to use another machine learning algorithm to combine the results of the individual machine learners, in the Stacking method, the individual learners are called primary learners, the learner used for combination is called secondary learner or meta-learner (meta-learner), and the data used by the secondary learner for training is called secondary training set. The secondary training set is obtained with the primary learner on the training set. The invention adopts the model to carry out five-fold cross validation on the training set to prevent overfitting of the model.

The data sets used were: tanjin shield (Booth Tarkinton, 22), Diuges (Charles Dickens, 44), Edies. Nernst bit (Edith Nesbit, 10), Arser. Konana. Dalton (Arthur Conan Doyle, 51), Mark Twain (29), Richarde. Francis Burton (Sir Richard Francis Burton, 11), Emmil. Caloborio (Emile Gaboriauu, 10), 7 writers 177, all of which are downloadable from the Gutenberg project site (www.gutngberg.com).

The 177 works are made into samples, are respectively marked with corresponding labels, and are divided into two parts according to a random distribution method, wherein 80% of the two parts are distributed into a training set, and 20% of the two parts are distributed into a testing set. According to the Zipf law, the frequency of the words in the text data is arranged in the order from big to small, the serial numbers are given in sequence, and the serial numbers are correspondingly substituted for the words in the text data, so that the text data can be converted into time sequence data.

The calculation formula of the frequency is specifically as follows:

（7）

where C represents the total number of occurrences of a word in an article and M represents the number of occurrences of the word in the article.

And performing time domain feature extraction on the sample data converted into the time sequence through Tsfresh, selecting features with interpretability and importance according to corresponding labels, and selecting features most representing text features again according to a principal component analysis method, wherein the feature selection is to select features with higher contribution degree according to feature contribution degree ranking, and the Tsfresh is automatically completed. The principle of the principal component analysis is that a group of coordinate axes which are mutually orthogonal are sequentially found from an original space, the first new coordinate axis is selected to be the direction with the largest square difference in original data, the second new coordinate axis is selected to be the plane which is orthogonal to the first coordinate axis and enables the variance to be the largest, the third axis is selected to be the plane which is orthogonal to the first axis and the second axis and has the largest square difference, and by analogy, n coordinate axes can be obtained, the new coordinate axes obtained by the method are provided, most of the variances are contained in the previous e coordinate axes, wherein e is an integer larger than 0, and e is smaller than n; the variance contained in the latter coordinate axis is almost 0, so that the rest coordinate axes can be ignored, only the former e coordinate axes containing most variances are reserved, namely, only the dimension features containing most variances are reserved, and the feature dimensions containing almost 0 variances are ignored, so that the data feature selection processing is realized.

The method comprises the steps of constructing a Stacking fusion model, using XGboost and LightGBM as primary learners, using SVM as a secondary learner, using a training set to respectively train two primary learners, performing 5-Fold cross validation, splicing results output by the two primary learners to generate a secondary training set, then using the secondary training set to train the secondary learner, and predicting a test set by the trained model to obtain a result. The 5-Fold cross validation process is to divide an original training set into 5 groups, make a validation set for each subset data respectively, and use the rest 5-1 groups of subset data as a new training set.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention are equivalent to or changed within the technical scope of the present invention.

Claims

1. A work author identification method based on time sequence is characterized by comprising the following specific steps:

firstly, converting text data into time series data according to a Zipf law;

2. The time-series based work author identification method of claim 1, wherein: in the first step, the frequency of the words in the text data is arranged in the order from large to small according to the Zipf law, the sequence numbers are given in sequence, and the sequence numbers are correspondingly substituted for the words in the text data, so that the text data can be converted into time sequence data.

3. The time-series-based work author identification method according to claim 1 or 2, wherein: in the step one, the text data is a data set obtained from a website, the data set comprises a plurality of works of a plurality of authors, the works are made into samples, corresponding labels are respectively marked on the samples, the samples are divided into two parts according to a random distribution method, one part of the works are distributed into a training set, and the other part of the works are distributed into a testing set.

4. The time-series based work author identification method of claim 3, wherein: in step two, the feature selection technology selects the features with interpretability and importance according to the corresponding labels.

5. The time-series based work author identification method of claim 3, wherein: in the second step, the principal component analysis method is to select the features which can represent the text features most.

6. The time-series based work author identification method of claim 3, wherein: in the third step, the XGBoost and the LightGBM are used as primary learners, the SVM is used as a secondary learner, the two primary learners are respectively trained by using a training set, cross validation is performed, the results output by the two primary learners are spliced to generate a secondary training set, then the secondary learner is trained by using the secondary training set, and the trained model is predicted to the test set to obtain a result.