CN112668318A - Work author identification method based on time sequence - Google Patents

Work author identification method based on time sequence Download PDF

Info

Publication number
CN112668318A
CN112668318A CN202110273383.XA CN202110273383A CN112668318A CN 112668318 A CN112668318 A CN 112668318A CN 202110273383 A CN202110273383 A CN 202110273383A CN 112668318 A CN112668318 A CN 112668318A
Authority
CN
China
Prior art keywords
text
data
time
author
works
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110273383.XA
Other languages
Chinese (zh)
Inventor
李泽朋
潘正颐
侯大为
马元巍
顾徐波
张焱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Weiyizhi Technology Co Ltd
Original Assignee
Changzhou Weiyizhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Weiyizhi Technology Co Ltd filed Critical Changzhou Weiyizhi Technology Co Ltd
Priority to CN202110273383.XA priority Critical patent/CN112668318A/en
Publication of CN112668318A publication Critical patent/CN112668318A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a work author identification method based on time sequence, which comprises the steps of firstly converting text data into time sequence data according to Zipf law; then, performing time domain feature extraction on the sample data converted into the time sequence through Tsfresh, and performing dimension reduction on the text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh; and finally, realizing Stacking model fusion by using the XGboost, LightGBM and SVM machine learning methods, realizing author prediction of the text according to the extracted text characteristics, and finishing author attribution judgment of the text. The work author identification method based on time series deduces whether other works are works of the author according to the works existing by the author.

Description

Work author identification method based on time sequence
Technical Field
The invention relates to the technical field of computers, in particular to a work author identification method based on time series.
Background
Chinese patents (application numbers CN201310043297.5, application date 20130202, publication numbers CN103106192B, and publication date 20160203) disclose a literary work author identification method and apparatus, which introduce that a word segmentation is performed on an input literary work to obtain a word segmentation phrase and a target occurrence frequency corresponding to the word segmentation phrase, an information entropy of the input literary work is calculated according to the target occurrence frequency, an information entropy of an author sample work and an author sample work corresponding to a target author is obtained, and whether an author of the input literary work is a target author is identified by comparing the information entropy of the author sample work with the information entropy of the input literary work. However, the patent does not extract text features from the perspective of time series, and the temporal features of the text may be ignored.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in order to solve the problems in the background art, a time-series-based work author identification method is provided, so that the authorship judgment of a text is realized, and whether other works are works of an author can be inferred according to the works of the author.
The technical scheme adopted by the invention for solving the technical problems is as follows: a work author identification method based on time sequence comprises the following specific steps:
firstly, converting text data into time series data according to a Zipf law;
secondly, extracting time domain features of the sample data converted into the time sequence through the Tsfresh, reducing dimensions of text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh, and performing feature selection and feature dimension reduction on the text from the angle of the time sequence in the process;
and finally, realizing Stacking model fusion through an XGboost machine learning method, a LightGBM machine learning method and an SVM machine learning method, realizing author prediction of the text according to the extracted text characteristics, and finishing author attribution judgment of the text.
Further specifically, in the above technical solution, in the first step, the frequency of occurrence of words in the text data is arranged in descending order according to the Zipf law, the serial numbers are given in turn, and the text data can be converted into time series data by replacing the words in the text data with the corresponding serial numbers.
Further specifically, in the above technical solution, in the step one, the text data is a data set obtained from a website, the data set includes a plurality of works of a plurality of authors, the works are made into samples, and corresponding labels are respectively printed on the samples, and the samples are divided into two parts according to a random distribution method, wherein one part of the works is distributed as a training set, and the other part of the works is distributed as a testing set.
Further specifically, in the above technical solution, in the step two, the feature selection technology selects the feature with explanatory and importance according to the corresponding tag.
In the above technical solution, in the step two, the principal component analysis method is to select a feature that can represent the most text features.
Further specifically, in the above technical solution, in the third step, the XGBoost and the LightGBM are used as primary learners, the SVM is used as a secondary learner, the two primary learners are respectively trained by using a training set, cross validation is performed, the results output by the two primary learners are spliced to generate a secondary training set, then the secondary learner is trained by using the secondary training set, and the trained model is predicted to the test set to obtain a result.
The invention has the beneficial effects that: the work author identification method based on the time sequence extracts the text characteristics from the angle of the time sequence, realizes the judgment of the author identity of the text, and can deduce whether other works are also works of the author according to the existing works of the author.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of text feature extraction and text feature dimension reduction;
FIG. 2 is a schematic diagram of Stacking model fusion.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1 and fig. 2, the method for identifying the author of the work based on the time sequence of the present invention includes the following specific steps:
firstly, arranging the occurrence frequency of words in text data according to a Zipf law from large to small, sequentially giving serial numbers, and correspondingly replacing the words in the text data by the serial numbers to convert the text data into time sequence data. Wherein, the mode of serial number giving is: the word with the highest frequency of occurrence is given the rank 1, the word with the second highest frequency of occurrence is given the rank 2, and so on.
It should be noted that: "the text data can be converted into time series data by replacing words in the text data with the corresponding serial numbers", the first half sentence explains the set serial numbers, and the second half sentence explains the text replaced with the serial numbers. For example, if the sequence numbers of the words "I love you, but you don't love me" are respectively set to "you" is 1, "love" is 2, "but" is 3, "me" is 4, "don't" is 5, "," is 6, "I" is 7, the sequence number is 721631524.
The text data is a data set obtained from a website, the data set comprises a plurality of works of a plurality of authors, the works are made into samples, corresponding labels are respectively printed on the samples, the samples are divided into two parts according to a random distribution method, one part of the works are distributed into a training set, and the other part of the works are distributed into a testing set. Wherein, the setting mode of the label is as follows: in several authors, the work belonging to author one is labeled 0, the work belonging to author two is labeled 1, and so on.
Zipf's law arranges the frequency of occurrence of words in text data in order from large to small, so the frequency of occurrence of words named r obeys the power law relationship:
Figure 268124DEST_PATH_IMAGE001
(1)
where P denotes frequency and a denotes a specific constant.
It indicates that in text data, only a very few words are frequently used, and the vast majority of words are rarely used.
And secondly, extracting time domain features of the sample data converted into the time sequence through the Tsfresh, reducing dimensions of the text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh, and performing feature selection and feature dimension reduction on the text from the angle of the time sequence.
The feature selection technology is to select features having explanatory and important properties according to the corresponding tags. Principal component analysis is the selection of features that best represent the characteristics of the text.
Feature extraction: tsfresh is a time series feature extraction tool based on scalable hypothesis testing, which contains a variety of feature extraction methods and robust feature selection algorithms, and Tsfresh can automatically extract thousands of features from a time series, which describe basic features of the time series, such as the number of peaks, the average or the maximum, or more complex features, such as time reversal symmetry statistics, etc. After the text is converted into the time sequence, the time sequence is one-dimensional time data, and some characteristics of the text are difficult to express, so that the characteristics capable of expressing the characteristics of the text can be extracted through Tsfresh, and the characteristics can be used for constructing a machine learning model.
The peak value refers to the value of the difference between the highest value or the lowest value of the time series to the average value in one period.
The average value is the value of each point in the time series divided by the total number of points, and the specific formula is as follows:
Figure 247581DEST_PATH_IMAGE002
(2)
where mean represents the average value, i represents the ith time point, t represents the value of the time point, and S represents the number of time points in the time series.
The value having the maximum value of the points in the time series is used
Figure 89635DEST_PATH_IMAGE003
Where t represents the value at the time point and S represents the number of time points in the time series.
And (3) feature dimensionality reduction: the time series usually contains noise and other irrelevant information, so that part of the extracted features contains interference information, the features related to the text can be selected by a dimension reduction method, Tsfresh evaluates the interpretability and importance of each feature on the label of the sample, and the method is based on a perfect hypothesis test theory, adopts a plurality of test methods, and can effectively select the features of the text data. Principal Component Analysis (PCA) converts a plurality of variables into a few uncorrelated composite variables to more comprehensively reflect the entire data set; this is because there is a certain correlation between the original variables in the data set, and information between the original variables can be integrated with fewer integrated variables, which are called principal components, and the principal components are not correlated with each other, i.e. the information represented by the integrated variables does not overlap.
And finally, realizing Stacking model fusion through an XGboost machine learning method, a LightGBM machine learning method and an SVM machine learning method, realizing author prediction of the text according to the extracted text characteristics, and finishing author attribution judgment of the text.
The XGboost and the LightGBM are used as primary learners, the SVM is used as a secondary learner, the two primary learners are respectively trained by using a training set for cross validation, the results output by the two primary learners are spliced to generate a secondary training set, then the secondary learner is trained by using the secondary training set, and the trained model is predicted to obtain a test set.
The XGboost is an optimized distributed Gradient enhancement library, aims to realize high efficiency, flexibility and portability, realizes a machine learning algorithm under a Gradient Boosting framework, provides parallel tree Boosting (also called GBDT and GBM), can quickly and accurately solve a plurality of data science problems, is a tree Boosting method, and is a decision tree as a basic classification model.
For the training data set:
Figure 547161DEST_PATH_IMAGE004
(3)
wherein the content of the first and second substances,
Figure 361534DEST_PATH_IMAGE005
d denotes a data set, x denotes a feature of the data set, y denotes a label of the data set, n denotes an nth sample data, i denotes an ith sample, R denotes a real number space, and m denotes a feature dimension of the sample as an m dimension.
The final model of the whole XGBoost is:
Figure 246313DEST_PATH_IMAGE006
(4)
wherein F represents the space of the decision tree,
Figure 575663DEST_PATH_IMAGE007
Figure 774563DEST_PATH_IMAGE008
a leaf label representing the decision tree,
Figure 508689DEST_PATH_IMAGE009
Figure 564369DEST_PATH_IMAGE010
representing a decision tree structure; g represents a decision function; f represents a characteristic function of the decision tree; m represents that the characteristic dimension of the sample is m dimension; r represents a real number space; t represents the number of leaf labels of the decision tree; the label on the specific tree leaf is composed of
Figure 318699DEST_PATH_IMAGE011
Is decided so that
Figure 118027DEST_PATH_IMAGE012
Viewed as a vector
Figure 703730DEST_PATH_IMAGE013
The value of each dimension is the label of one leaf;
Figure 133574DEST_PATH_IMAGE014
representing the kth decision function.
The learning function of XGboost is:
Figure 437516DEST_PATH_IMAGE015
(5)
the ultimate goal of learning is to minimize
Figure 774957DEST_PATH_IMAGE016
Wherein, in the step (A),
Figure 418428DEST_PATH_IMAGE017
is a loss function;
Figure 815911DEST_PATH_IMAGE018
is a regularization term to prevent overfitting;
Figure 607149DEST_PATH_IMAGE019
Figure 685964DEST_PATH_IMAGE020
is the object of parameter adjustment.
Second, LightGBM is a more powerful, faster model than XGBoost, has very big promotion in the performance, compares the advantage that has with traditional algorithm and is: the training efficiency is higher, the accuracy is higher, the parallelization learning is supported, and large-scale data can be processed.
LightGBM is a decision number algorithm based on Histogram, and is characterized in that continuous floating point characteristic values are discretized into Z integers, and a Histogram with the width of Z is constructed; when data is traversed, the discretized value is used as the cumulative statistic in the histogram, after the word of data is traversed, the histogram accumulates the needed statistic, and then the optimal segmentation point is searched in a traversing mode according to the discretized value of the histogram.
LightGBM adopts a leaf-wise growth strategy, finds out one leaf with the largest splitting gain (generally, the largest data volume) from all the current leaves each time, then splits, and circulates in such a way, so that better precision can be obtained; a disadvantage of Leaf-wise is that a deeper decision tree may be grown, resulting in an overfitting. LightGBM therefore adds a maximum depth limit above the Leaf-wise, preventing overfitting while ensuring high efficiency.
The basic principle of the Support Vector Machine (SVM) is to find an optimal classification surface, so that the distance between the closest point in several classes to the classification surface and the classification surface is the largest. In that
Figure 243329DEST_PATH_IMAGE021
And
Figure 14975DEST_PATH_IMAGE022
between two classes (which can be expanded into multiple classes), there are multiple classification planes to separate the two classes accurately.
These classification planes may be defined as:
Figure 27931DEST_PATH_IMAGE023
wherein
Figure 972753DEST_PATH_IMAGE024
Is the inner product of the vector;
Figure 325237DEST_PATH_IMAGE025
is a scalar. The support vector machine finds the optimal classification surface in the classification surfaces
Figure 330102DEST_PATH_IMAGE026
Figure 768037DEST_PATH_IMAGE027
Is characterized in that
Figure 250971DEST_PATH_IMAGE028
And
Figure 785857DEST_PATH_IMAGE029
middle distance
Figure 899307DEST_PATH_IMAGE030
The nearest point, and
Figure 886854DEST_PATH_IMAGE031
the sum of the distances between them is the largest, also called having the largest spacing.
Fourthly, fusing the Stacking model: based on the machine learning model, the invention uses a Stacking fusion model, the core idea of Stacking is to use another machine learning algorithm to combine the results of the individual machine learners, in the Stacking method, the individual learners are called primary learners, the learner used for combination is called secondary learner or meta-learner (meta-learner), and the data used by the secondary learner for training is called secondary training set. The secondary training set is obtained with the primary learner on the training set. The invention adopts the model to carry out five-fold cross validation on the training set to prevent overfitting of the model.
The data sets used were: tanjin shield (Booth Tarkinton, 22), Diuges (Charles Dickens, 44), Edies. Nernst bit (Edith Nesbit, 10), Arser. Konana. Dalton (Arthur Conan Doyle, 51), Mark Twain (29), Richarde. Francis Burton (Sir Richard Francis Burton, 11), Emmil. Caloborio (Emile Gaboriauu, 10), 7 writers 177, all of which are downloadable from the Gutenberg project site (www.gutngberg.com).
The 177 works are made into samples, are respectively marked with corresponding labels, and are divided into two parts according to a random distribution method, wherein 80% of the two parts are distributed into a training set, and 20% of the two parts are distributed into a testing set. According to the Zipf law, the frequency of the words in the text data is arranged in the order from big to small, the serial numbers are given in sequence, and the serial numbers are correspondingly substituted for the words in the text data, so that the text data can be converted into time sequence data.
The calculation formula of the frequency is specifically as follows:
Figure 111162DEST_PATH_IMAGE032
(7)
where C represents the total number of occurrences of a word in an article and M represents the number of occurrences of the word in the article.
And performing time domain feature extraction on the sample data converted into the time sequence through Tsfresh, selecting features with interpretability and importance according to corresponding labels, and selecting features most representing text features again according to a principal component analysis method, wherein the feature selection is to select features with higher contribution degree according to feature contribution degree ranking, and the Tsfresh is automatically completed. The principle of the principal component analysis is that a group of coordinate axes which are mutually orthogonal are sequentially found from an original space, the first new coordinate axis is selected to be the direction with the largest square difference in original data, the second new coordinate axis is selected to be the plane which is orthogonal to the first coordinate axis and enables the variance to be the largest, the third axis is selected to be the plane which is orthogonal to the first axis and the second axis and has the largest square difference, and by analogy, n coordinate axes can be obtained, the new coordinate axes obtained by the method are provided, most of the variances are contained in the previous e coordinate axes, wherein e is an integer larger than 0, and e is smaller than n; the variance contained in the latter coordinate axis is almost 0, so that the rest coordinate axes can be ignored, only the former e coordinate axes containing most variances are reserved, namely, only the dimension features containing most variances are reserved, and the feature dimensions containing almost 0 variances are ignored, so that the data feature selection processing is realized.
The method comprises the steps of constructing a Stacking fusion model, using XGboost and LightGBM as primary learners, using SVM as a secondary learner, using a training set to respectively train two primary learners, performing 5-Fold cross validation, splicing results output by the two primary learners to generate a secondary training set, then using the secondary training set to train the secondary learner, and predicting a test set by the trained model to obtain a result. The 5-Fold cross validation process is to divide an original training set into 5 groups, make a validation set for each subset data respectively, and use the rest 5-1 groups of subset data as a new training set.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention are equivalent to or changed within the technical scope of the present invention.

Claims (6)

1. A work author identification method based on time sequence is characterized by comprising the following specific steps:
firstly, converting text data into time series data according to a Zipf law;
secondly, extracting time domain features of the sample data converted into the time sequence through the Tsfresh, reducing dimensions of text feature data according to a feature selection technology and a principal component analysis method of the Tsfresh, and performing feature selection and feature dimension reduction on the text from the angle of the time sequence in the process;
and finally, realizing Stacking model fusion through an XGboost machine learning method, a LightGBM machine learning method and an SVM machine learning method, realizing author prediction of the text according to the extracted text characteristics, and finishing author attribution judgment of the text.
2. The time-series based work author identification method of claim 1, wherein: in the first step, the frequency of the words in the text data is arranged in the order from large to small according to the Zipf law, the sequence numbers are given in sequence, and the sequence numbers are correspondingly substituted for the words in the text data, so that the text data can be converted into time sequence data.
3. The time-series-based work author identification method according to claim 1 or 2, wherein: in the step one, the text data is a data set obtained from a website, the data set comprises a plurality of works of a plurality of authors, the works are made into samples, corresponding labels are respectively marked on the samples, the samples are divided into two parts according to a random distribution method, one part of the works are distributed into a training set, and the other part of the works are distributed into a testing set.
4. The time-series based work author identification method of claim 3, wherein: in step two, the feature selection technology selects the features with interpretability and importance according to the corresponding labels.
5. The time-series based work author identification method of claim 3, wherein: in the second step, the principal component analysis method is to select the features which can represent the text features most.
6. The time-series based work author identification method of claim 3, wherein: in the third step, the XGBoost and the LightGBM are used as primary learners, the SVM is used as a secondary learner, the two primary learners are respectively trained by using a training set, cross validation is performed, the results output by the two primary learners are spliced to generate a secondary training set, then the secondary learner is trained by using the secondary training set, and the trained model is predicted to the test set to obtain a result.
CN202110273383.XA 2021-03-15 2021-03-15 Work author identification method based on time sequence Pending CN112668318A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110273383.XA CN112668318A (en) 2021-03-15 2021-03-15 Work author identification method based on time sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110273383.XA CN112668318A (en) 2021-03-15 2021-03-15 Work author identification method based on time sequence

Publications (1)

Publication Number Publication Date
CN112668318A true CN112668318A (en) 2021-04-16

Family

ID=75399432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110273383.XA Pending CN112668318A (en) 2021-03-15 2021-03-15 Work author identification method based on time sequence

Country Status (1)

Country Link
CN (1) CN112668318A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109341020A (en) * 2018-09-27 2019-02-15 重庆智万家科技有限公司 A kind of intelligent temperature control adjusting method based on big data
CN111460148A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111582331A (en) * 2020-04-23 2020-08-25 浙江大学 Painting work author image identification method based on convolutional neural network
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device
CN112070154A (en) * 2020-09-07 2020-12-11 常州微亿智造科技有限公司 Time series data processing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109341020A (en) * 2018-09-27 2019-02-15 重庆智万家科技有限公司 A kind of intelligent temperature control adjusting method based on big data
CN111460148A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111582331A (en) * 2020-04-23 2020-08-25 浙江大学 Painting work author image identification method based on convolutional neural network
CN112070154A (en) * 2020-09-07 2020-12-11 常州微亿智造科技有限公司 Time series data processing method and device
CN111931868A (en) * 2020-09-24 2020-11-13 常州微亿智造科技有限公司 Time series data abnormity detection method and device

Similar Documents

Publication Publication Date Title
CN110209823B (en) Multi-label text classification method and system
CN108170736B (en) Document rapid scanning qualitative method based on cyclic attention mechanism
CN108804421B (en) Text similarity analysis method and device, electronic equipment and computer storage medium
US20150095017A1 (en) System and method for learning word embeddings using neural language models
CN109271514B (en) Generation method, classification method, device and storage medium of short text classification model
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN111753550A (en) Semantic parsing method for natural language
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
Farhoodi et al. Applying machine learning algorithms for automatic Persian text classification
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN111159332A (en) Text multi-intention identification method based on bert
CN115687567A (en) Method for searching similar long text by short text without marking data
Jayady et al. Theme Identification using Machine Learning Techniques
Meddeb et al. Using topic modeling and word embedding for topic extraction in Twitter
Shang et al. Improved feature weight algorithm and its application to text classification
Pawar et al. Text summarization using document and sentence clustering
CN113111178A (en) Method and device for disambiguating homonymous authors based on expression learning without supervision
Uy et al. A study on the use of genetic programming for automatic text summarization
Peleshchak et al. Text Tonality Classification Using a Hybrid Convolutional Neural Network with Parallel and Sequential Connections Between Layers.
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
CN112668318A (en) Work author identification method based on time sequence
CN112836491B (en) NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model
CN114896404A (en) Document classification method and device
CN114943236A (en) Keyword extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210416