CN111625578B

CN111625578B - Feature extraction method suitable for time series data in cultural science and technology fusion field

Info

Publication number: CN111625578B
Application number: CN202010453118.5A
Authority: CN
Inventors: 王妍; 田玲玲; 刘迪; 刘德伟; 谭爱平
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2023-12-08
Anticipated expiration: 2040-05-26
Also published as: CN111625578A

Abstract

The feature extraction method suitable for the time series data in the cultural science and technology fusion field comprises the following steps: 1) Obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data; 2) Classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after macroscopic data is standardized, calculating the similarity between a sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics; 3) Setting an optimal shape set of the existing standard time sequence, and calculating the distance between microscopic data and each shape to obtain trend characteristics; 4) The text data firstly uses a word bag model to obtain a high-frequency word set, and then carries out secondary filtration on the word set by using an improved TF-IDF to obtain a hot word set; 5) Re-executing the steps 1-4 by using the sliding window for new data; stopping when no new data exists; the method can rapidly process and analyze the time sequence data, and is helpful for making strategic decisions of enterprises.

Description

Feature extraction method suitable for time series data in cultural science and technology fusion field

Technical Field

The invention provides a time sequence data mining method based on time granularity division aiming at the data characteristics of time sequence data in the field of cultural science and technology fusion and the effect of hidden characteristics of the time sequence data on industry prediction.

Background

Along with the progress of science and technology and the adjustment of economic development strategy in China, the development of emerging industries and enterprises with culture as the core and science and technology as means formally becomes a large situation advocated in various places. On one hand, when predicting the development hot spot of a general enterprise, financial data or modeling the behavior of a user are mainly utilized, and the multi-angle mining of time series data is not particularly concerned; on the other hand, general time-series data mining does not have the characteristics of cultural science and technology fusion data, and data mining is not performed on time-series data from the classification of time granularity. For an industry or enterprise, the amount of information carried by time series data is huge and is particularly important for the enterprise to develop development strategies, and fully mining the information carried by time series data is a target for many people to explore as much as possible.

On the one hand, the time series data in the cultural science and technology fusion field has unique characteristics: time series data with different time granularity have different important information and less sample data; on the other hand, the common time sequence feature mining algorithm only mines time sequence information from a single angle, and cannot fully obtain the value of data. These limitations are detrimental to businesses that want to fully mine time data to gain support for decisions.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a feature extraction method suitable for time series data in the field of cultural science and technology integration. Firstly, obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data; the numerical data are classified according to time granularity, so that macroscopic time series data and microscopic time series data are obtained. After the macroscopic data are standardized, calculating the similarity between the sample data and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics; setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics; text data, firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words; if new data exist, re-executing the steps 1-4 by using the sliding window; and stopping when no new data exists.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the feature extraction method suitable for the time series data in the cultural science and technology fusion field is characterized by comprising the following steps of: the method comprises the following steps:

step 1), obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data;

step 2), classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after the macroscopic data is standardized, calculating the similarity between the sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics;

step 3), setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics;

step 4), text data, namely firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words;

step 5), if new data exist, re-executing the steps 1-4 by using the sliding window; and stopping when no new data exists.

In the step 1), the specific method is as follows:

1.1 Acquiring source data): obtaining time series data from an enterprise database or a public database of a corresponding government;

1.2 Classifying the data according to the data types, and dividing the data into numerical time series data and text time series data;

1.3 A new classified data source is formed for the new data using the sliding window principle.

In the step 2), the specific method is as follows:

2.1 Classifying the numerical time series data according to the time granularity, and dividing the numerical time series data into macro data Tg with large time granularity and micro data Ts with small time granularity, such as year data and month data or month data and day data; wherein the macroscopic data includes time series data of multiple sources, such as annual income of enterprises, annual financing amount, annual market share and the like; the microscopic data only comprises one type of time series data, such as important business month sales or daily sales;

2.2 For macroscopic data, the data is normalized by z-normalization, and the Euclidean distance between the sample and industry standard data classified according to the Boston matrix is calculated, wherein the formula is that

Wherein the method comprises the steps ofm is the number of data in a time sequence, delta _x Variance of X, delta _y Variance of Y, mu _x Mean value of X, mu _y X is the mean value of Y, Y is the standardized sample data, Y is the standard data, X is the specific value at a certain moment of any X time sequence, and Y is the specific value at a certain moment of any Y time sequence;

2.3 Normalized data, taking the normalized distance weight as the input of a D-S evidence theory matrix, and then carrying out evidence fusion according to the D-S evidence theory principle to obtain a new comprehensive support degree omega _i I refers to the i-th category of the industry;

2.4 According to the new comprehensive support degree, the comprehensive weight of the category to which the sample belongs on the macro level is obtained, and the classification characteristics of the time sequence can be measured.

In the step 3), the specific method is as follows:

3.1 For microscopic time series data, it is assumed that k shapelets for which standard data has been obtained are written as s=<s ₁ ,s ₂ ,...s _k >And s _i Length L _i The method comprises the steps of carrying out a first treatment on the surface of the Circularly calculating the distances between the k shapelets and the sample time sequence, wherein the distances are defined as

dist _i ＝min(dist(Sub(Ts) _Li ，s _i ))(i＝1,2...k)

Wherein Sub (Ts) _Li Refers to a time series Ts of length L _i Is a subsequence of (a);

3.2 Calculating the weight u _i ＝L _i /(L ₁ +L ₂ +...+L _k )；

3.3 Multiplying the obtained weight by the reciprocal of the distance and then adding, and obtaining the trend characteristic of the microscopic data according to the meaning of the distance and the property of the shape.

In the step 4), the specific method is as follows:

4.1 For text time series, each document is added with a time attribute according to the obtained time, namely, a data set D= { (t) containing a plurality of documents ₁ ,d ₁ ),(t ₂ ,d ₂ )...(t _n ,d _n ) And (t) ₁ ,d ₁ ) Refer to document d ₁ Time attribute t of (2) ₁ ；

4.2 Using a bag of words model to count the high frequency vocabulary for the text data: counting word frequency after word segmentation of each document, setting the maximum capacity of a word bag as MAX_f, setting the minimum frequency of words as min_df, filtering words which do not appear in any document and have few occurrence times, and setting the maximum frequency of words as max_df, and filtering words with abnormal occurrence frequency; forming a high-frequency vocabulary dictionary through a word bag model;

4.3 Based on dictionary, calculate TF _i ，TF _i The TF vector formed by a document is multiplied by the document time weight lambda according to the time attribute of the document _i Wherein the more recent time weight is greater, and the TF vector formed by a document is multiplied by the document time weight lambda according to the time attribute of the document _i Wherein the more recent the time, the more weight, to obtain TF _i ' i.e. the word frequency vector of the i th document after correction after considering the time attribute of the document;

4.4 Calculating the inverse document frequency IDF, the calculation formula being IDF (a) =ln ((1+n)/n) _a ) Wherein a denotes a word, n denotes the total number of documents, n _a The total number of documents in which term a appears; combining the TF' vector of each document with the IDF of the vocabulary to obtain the frequency delta of the term in dictionary taking into account the time attribute in the whole D _j A threshold min_dic is set for filtering delta again _j Words smaller than min_dic finally obtain the nearest high-frequency vocabulary; wherein delta _j Refers to the final frequency information of the j-th word.

In the step 5), the specific method is as follows:

5.1 Storing the obtained features and hot spot vocabulary;

5.2 If there is newly acquired data, re-executing step 1-4 by utilizing the sliding window principle, otherwise stopping processing

The beneficial effects of the invention are as follows:

aiming at the existing problems, the invention provides a feature extraction method suitable for time series data in the field of cultural science and technology integration; firstly, obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data; classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after the macroscopic data are standardized, calculating the similarity between the sample data and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics; setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics; text data, firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words; if new data exist, re-executing the steps 1-4 by using the sliding window; stopping when no new data exists; the method can rapidly process and analyze the time sequence data, and is helpful for making strategic decisions of enterprises.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The feature extraction method suitable for the time series data in the cultural science and technology fusion field comprises the following steps:

the specific method comprises the following steps:

Step 2), classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after macroscopic data is standardized, calculating the similarity between a sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics;

the specific method comprises the following steps:

2.2 For macroscopic data, the data is normalized by z-normalization, and the Euclidean distance between the sample and industry standard data classified according to the Boston matrix is calculated, wherein the formula is thatWherein->m is the number of data in a time sequence, delta _x 、δ _y 、μ _x 、μ _y The variance and mean of X, Y, respectively; x refers to standardized sample data, Y refers to standard data, X refers to a specific numerical value at a certain moment of any X time sequence, and Y refers to a specific numerical value at a certain moment of any Y time sequence;

2.3 Normalized data, taking the normalized distance weight as the input of a D-S evidence theory matrix, and then carrying out evidence fusion according to the D-S evidence theory principle to obtain a new dataComprehensive support degree omega _i I refers to the i-th category of the industry;

the specific method comprises the following steps:

3.1 For microscopic time series data, it is assumed that k shapelets for which standard data has been obtained are written as s=<s ₁ ,s ₂ ,...s _k >And s _i Length L _i The method comprises the steps of carrying out a first treatment on the surface of the Circularly calculating the distances between the k shapelets and the sample time series, wherein the distance is defined as dist _i ＝min(dist(Sub(Ts) _Li ，s _i ) (i=1, 2..k), wherein Sub (Ts) _Li Refers to a time series Ts of length L _i The calculation method of dist (X, Y) is the same as that of step 2.2);

3.2 Calculating the weight u _i ＝L _i /(L ₁ +L ₂ +...+L _k )；

the specific method comprises the following steps:

4.2 Using a bag-of-words model to count high-frequency word collections for the text data. Word frequency statistics is carried out after the word segmentation of each document, wherein the maximum capacity of a word bag is set to be MAX_f according to expert opinions, the lowest frequency of the words is set to be min_df, the words which do not appear in any document and have few occurrence times are filtered, and the maximum frequency of the words is max_df, and words with abnormal occurrence frequencies, such as words like the conjunctions of 'yes'. Forming a high-frequency vocabulary dictionary through a word bag model;

4.3 Based on dictionary, calculate TF _i ，TF _i Word frequency information of words in the ith document and in dictionary, wherein the words are in dictionary; and according to the time attribute of the document, TF formed by one document is formed _i Vector multiplication by the document time weight lambda _i ，λ _i Is set by the instruction of an economic expert, wherein the more new time is weighted more, the TF is obtained _i ' i.e. the word frequency vector of the i th document after correction after considering the time attribute of the document;

Step 5), if new data exist, re-executing the steps 1-4 by using the sliding window; stopping when no new data exists;

the specific method comprises the following steps:

5.1 Storing the obtained features and hot spot vocabulary;

5.2 If there is newly obtained data, re-executing step 1-4 by utilizing the sliding window principle, otherwise stopping processing.

The time-series data described by the present inventor means various data accumulated by the enterprise itself and useful for the development of the enterprise. Examples include: such as 4 types of time series data of income data, financing data, liability data, and user quantity variation data in units of years; in the years, the sales data is month by month and the sales data is day by day in five years.

Example 1:

examples: suppose that company B wants to obtain information from its historical data that is useful for the strategic decision of the enterprise's development. The time series data mainly comprises 4 types of time series data such as income data, financing data, liability data, user quantity change data and the like of the last five years; daily sales data over five years, and own meeting records and many news stories.

Firstly, classifying data according to time granularity by numerical data, wherein macroscopic data Tg is 4-class annual data, and microscopic data Ts is daily data; then, macroscopic data are processed, and industries can be divided into four categories of stars, thin dogs, questions and cattle according to Boston matrix in economy, wherein the theory is used for company classification. Then the data is z-normalized, and then the distance Dist (X _j ,Y _ij )(X _j Refers to the sample time series data normalized by the j-th strip, j=1, 2, 3, 4, y _ij Refers to the j standard data that is homologous to sample data in the i-th industry).

And normalizing the distances to be used as the input of a D-S evidence theory, and obtaining the comprehensive probability omega i (i refers to the ith class in the Boston matrix) of the company belonging to the classes calculated by the samples through evidence fusion, so as to obtain the class characteristics of the company from a macroscopic perspective.

Reprocessing the microscopic data, assuming a highly identified, complete set of best shapelets S in the last five years of day sales data for a standard business has been obtained for comparison, there are k shapelets, and each shapelet has a length L _i Then, the distances between the k shapelets and the sample time series are calculated circularly, and the length L in Ts is calculated repeatedly _i Subsequence and s of (2) _i The Euclidean distance between the two points is finally determined _i ；

Then calculate s _i Weights u to Ts _i Since the smaller the distance is, the more similar the two sequences are, the weight is multiplied by the distance and k such sequences are again multipliedThe trend feature is obtained, the smaller the result is, the closer the sales of company B is to the sales of the company it compares.

Finally, for meeting records and external news reports of the company B in the last five years, a word bag model and an improved TF_IDF are utilized to obtain hot words in the latest time, the words can suggest the attention points of the company B in the recent years and the attention points of the outside, after the data processing is finished, if new data are to be processed continuously, the data are processed again in this way by utilizing the sliding window principle, otherwise, the data are not processed any more and the mined information is applied to decision support.

Claims

1. The feature extraction method suitable for the time series data in the cultural science and technology fusion field is characterized by comprising the following steps of: the method comprises the following steps:

the specific method comprises the following steps:

2.1 Classifying the numerical time series data according to the time granularity, and dividing the numerical time series data into macro data Tg with large time granularity and micro data Ts with small time granularity; wherein, the macroscopic data comprises time series data of a plurality of sources, namely financial data and market data reflecting the business operation condition of enterprises from different angles, and the microscopic data only comprises one type of time series data, namely the most main index data for measuring the business operation capability of enterprises;

2.4 According to the new comprehensive support degree, obtaining the comprehensive weight of the category to which the sample belongs on the macro level, and measuring the classification characteristic of the time sequence;

2. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:

in the step 1), the specific method is as follows:

3. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:

in the step 3), the specific method is as follows:

dist _i ＝min(dist(Sub(Ts) _Li ，s _i ))(i＝1,2...k)

3.2 Calculating the weight u _i ＝L _i /(L ₁ +L ₂ +...+L _k )；

4. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:

in the step 4), the specific method is as follows:

4.3 Based on dictionary, calculate TF _i ，TF _i The TF vector formed by a document is multiplied by the document time weight lambda according to the time attribute of the document _i Wherein the more recent the time, the more weight, to obtain TF _i ’，TF _i ' the word frequency vector of the ith corrected document is obtained by considering the time attribute of the document;

5. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:

in the step 5), the specific method is as follows:

5.1 Storing the obtained features and hot spot vocabulary;