CN111625578B - Feature extraction method suitable for time series data in cultural science and technology fusion field - Google Patents

Feature extraction method suitable for time series data in cultural science and technology fusion field Download PDF

Info

Publication number
CN111625578B
CN111625578B CN202010453118.5A CN202010453118A CN111625578B CN 111625578 B CN111625578 B CN 111625578B CN 202010453118 A CN202010453118 A CN 202010453118A CN 111625578 B CN111625578 B CN 111625578B
Authority
CN
China
Prior art keywords
data
time
time series
document
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010453118.5A
Other languages
Chinese (zh)
Other versions
CN111625578A (en
Inventor
王妍
田玲玲
刘迪
刘德伟
谭爱平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202010453118.5A priority Critical patent/CN111625578B/en
Publication of CN111625578A publication Critical patent/CN111625578A/en
Application granted granted Critical
Publication of CN111625578B publication Critical patent/CN111625578B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The feature extraction method suitable for the time series data in the cultural science and technology fusion field comprises the following steps: 1) Obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data; 2) Classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after macroscopic data is standardized, calculating the similarity between a sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics; 3) Setting an optimal shape set of the existing standard time sequence, and calculating the distance between microscopic data and each shape to obtain trend characteristics; 4) The text data firstly uses a word bag model to obtain a high-frequency word set, and then carries out secondary filtration on the word set by using an improved TF-IDF to obtain a hot word set; 5) Re-executing the steps 1-4 by using the sliding window for new data; stopping when no new data exists; the method can rapidly process and analyze the time sequence data, and is helpful for making strategic decisions of enterprises.

Description

Feature extraction method suitable for time series data in cultural science and technology fusion field
Technical Field
The invention provides a time sequence data mining method based on time granularity division aiming at the data characteristics of time sequence data in the field of cultural science and technology fusion and the effect of hidden characteristics of the time sequence data on industry prediction.
Background
Along with the progress of science and technology and the adjustment of economic development strategy in China, the development of emerging industries and enterprises with culture as the core and science and technology as means formally becomes a large situation advocated in various places. On one hand, when predicting the development hot spot of a general enterprise, financial data or modeling the behavior of a user are mainly utilized, and the multi-angle mining of time series data is not particularly concerned; on the other hand, general time-series data mining does not have the characteristics of cultural science and technology fusion data, and data mining is not performed on time-series data from the classification of time granularity. For an industry or enterprise, the amount of information carried by time series data is huge and is particularly important for the enterprise to develop development strategies, and fully mining the information carried by time series data is a target for many people to explore as much as possible.
On the one hand, the time series data in the cultural science and technology fusion field has unique characteristics: time series data with different time granularity have different important information and less sample data; on the other hand, the common time sequence feature mining algorithm only mines time sequence information from a single angle, and cannot fully obtain the value of data. These limitations are detrimental to businesses that want to fully mine time data to gain support for decisions.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a feature extraction method suitable for time series data in the field of cultural science and technology integration. Firstly, obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data; the numerical data are classified according to time granularity, so that macroscopic time series data and microscopic time series data are obtained. After the macroscopic data are standardized, calculating the similarity between the sample data and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics; setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics; text data, firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words; if new data exist, re-executing the steps 1-4 by using the sliding window; and stopping when no new data exists.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the feature extraction method suitable for the time series data in the cultural science and technology fusion field is characterized by comprising the following steps of: the method comprises the following steps:
step 1), obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data;
step 2), classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after the macroscopic data is standardized, calculating the similarity between the sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics;
step 3), setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics;
step 4), text data, namely firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words;
step 5), if new data exist, re-executing the steps 1-4 by using the sliding window; and stopping when no new data exists.
In the step 1), the specific method is as follows:
1.1 Acquiring source data): obtaining time series data from an enterprise database or a public database of a corresponding government;
1.2 Classifying the data according to the data types, and dividing the data into numerical time series data and text time series data;
1.3 A new classified data source is formed for the new data using the sliding window principle.
In the step 2), the specific method is as follows:
2.1 Classifying the numerical time series data according to the time granularity, and dividing the numerical time series data into macro data Tg with large time granularity and micro data Ts with small time granularity, such as year data and month data or month data and day data; wherein the macroscopic data includes time series data of multiple sources, such as annual income of enterprises, annual financing amount, annual market share and the like; the microscopic data only comprises one type of time series data, such as important business month sales or daily sales;
2.2 For macroscopic data, the data is normalized by z-normalization, and the Euclidean distance between the sample and industry standard data classified according to the Boston matrix is calculated, wherein the formula is that
Wherein the method comprises the steps ofm is the number of data in a time sequence, delta x Variance of X, delta y Variance of Y, mu x Mean value of X, mu y X is the mean value of Y, Y is the standardized sample data, Y is the standard data, X is the specific value at a certain moment of any X time sequence, and Y is the specific value at a certain moment of any Y time sequence;
2.3 Normalized data, taking the normalized distance weight as the input of a D-S evidence theory matrix, and then carrying out evidence fusion according to the D-S evidence theory principle to obtain a new comprehensive support degree omega i I refers to the i-th category of the industry;
2.4 According to the new comprehensive support degree, the comprehensive weight of the category to which the sample belongs on the macro level is obtained, and the classification characteristics of the time sequence can be measured.
In the step 3), the specific method is as follows:
3.1 For microscopic time series data, it is assumed that k shapelets for which standard data has been obtained are written as s=<s 1 ,s 2 ,...s k >And s i Length L i The method comprises the steps of carrying out a first treatment on the surface of the Circularly calculating the distances between the k shapelets and the sample time sequence, wherein the distances are defined as
dist i =min(dist(Sub(Ts) Li ,s i ))(i=1,2...k)
Wherein Sub (Ts) Li Refers to a time series Ts of length L i Is a subsequence of (a);
3.2 Calculating the weight u i =L i /(L 1 +L 2 +...+L k );
3.3 Multiplying the obtained weight by the reciprocal of the distance and then adding, and obtaining the trend characteristic of the microscopic data according to the meaning of the distance and the property of the shape.
In the step 4), the specific method is as follows:
4.1 For text time series, each document is added with a time attribute according to the obtained time, namely, a data set D= { (t) containing a plurality of documents 1 ,d 1 ),(t 2 ,d 2 )...(t n ,d n ) And (t) 1 ,d 1 ) Refer to document d 1 Time attribute t of (2) 1
4.2 Using a bag of words model to count the high frequency vocabulary for the text data: counting word frequency after word segmentation of each document, setting the maximum capacity of a word bag as MAX_f, setting the minimum frequency of words as min_df, filtering words which do not appear in any document and have few occurrence times, and setting the maximum frequency of words as max_df, and filtering words with abnormal occurrence frequency; forming a high-frequency vocabulary dictionary through a word bag model;
4.3 Based on dictionary, calculate TF i ,TF i The TF vector formed by a document is multiplied by the document time weight lambda according to the time attribute of the document i Wherein the more recent time weight is greater, and the TF vector formed by a document is multiplied by the document time weight lambda according to the time attribute of the document i Wherein the more recent the time, the more weight, to obtain TF i ' i.e. the word frequency vector of the i th document after correction after considering the time attribute of the document;
4.4 Calculating the inverse document frequency IDF, the calculation formula being IDF (a) =ln ((1+n)/n) a ) Wherein a denotes a word, n denotes the total number of documents, n a The total number of documents in which term a appears; combining the TF' vector of each document with the IDF of the vocabulary to obtain the frequency delta of the term in dictionary taking into account the time attribute in the whole D j A threshold min_dic is set for filtering delta again j Words smaller than min_dic finally obtain the nearest high-frequency vocabulary; wherein delta j Refers to the final frequency information of the j-th word.
In the step 5), the specific method is as follows:
5.1 Storing the obtained features and hot spot vocabulary;
5.2 If there is newly acquired data, re-executing step 1-4 by utilizing the sliding window principle, otherwise stopping processing
The beneficial effects of the invention are as follows:
aiming at the existing problems, the invention provides a feature extraction method suitable for time series data in the field of cultural science and technology integration; firstly, obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data; classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after the macroscopic data are standardized, calculating the similarity between the sample data and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics; setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics; text data, firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words; if new data exist, re-executing the steps 1-4 by using the sliding window; stopping when no new data exists; the method can rapidly process and analyze the time sequence data, and is helpful for making strategic decisions of enterprises.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The feature extraction method suitable for the time series data in the cultural science and technology fusion field comprises the following steps:
step 1), obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data;
the specific method comprises the following steps:
1.1 Acquiring source data): obtaining time series data from an enterprise database or a public database of a corresponding government;
1.2 Classifying the data according to the data types, and dividing the data into numerical time series data and text time series data;
1.3 A new classified data source is formed for the new data using the sliding window principle.
Step 2), classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after macroscopic data is standardized, calculating the similarity between a sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics;
the specific method comprises the following steps:
2.1 Classifying the numerical time series data according to the time granularity, and dividing the numerical time series data into macro data Tg with large time granularity and micro data Ts with small time granularity, such as year data and month data or month data and day data; wherein the macroscopic data includes time series data of multiple sources, such as annual income of enterprises, annual financing amount, annual market share and the like; the microscopic data only comprises one type of time series data, such as important business month sales or daily sales;
2.2 For macroscopic data, the data is normalized by z-normalization, and the Euclidean distance between the sample and industry standard data classified according to the Boston matrix is calculated, wherein the formula is thatWherein->m is the number of data in a time sequence, delta x 、δ y 、μ x 、μ y The variance and mean of X, Y, respectively; x refers to standardized sample data, Y refers to standard data, X refers to a specific numerical value at a certain moment of any X time sequence, and Y refers to a specific numerical value at a certain moment of any Y time sequence;
2.3 Normalized data, taking the normalized distance weight as the input of a D-S evidence theory matrix, and then carrying out evidence fusion according to the D-S evidence theory principle to obtain a new dataComprehensive support degree omega i I refers to the i-th category of the industry;
2.4 According to the new comprehensive support degree, the comprehensive weight of the category to which the sample belongs on the macro level is obtained, and the classification characteristics of the time sequence can be measured.
Step 3), setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics;
the specific method comprises the following steps:
3.1 For microscopic time series data, it is assumed that k shapelets for which standard data has been obtained are written as s=<s 1 ,s 2 ,...s k >And s i Length L i The method comprises the steps of carrying out a first treatment on the surface of the Circularly calculating the distances between the k shapelets and the sample time series, wherein the distance is defined as dist i =min(dist(Sub(Ts) Li ,s i ) (i=1, 2..k), wherein Sub (Ts) Li Refers to a time series Ts of length L i The calculation method of dist (X, Y) is the same as that of step 2.2);
3.2 Calculating the weight u i =L i /(L 1 +L 2 +...+L k );
3.3 Multiplying the obtained weight by the reciprocal of the distance and then adding, and obtaining the trend characteristic of the microscopic data according to the meaning of the distance and the property of the shape.
Step 4), text data, namely firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words;
the specific method comprises the following steps:
4.1 For text time series, each document is added with a time attribute according to the obtained time, namely, a data set D= { (t) containing a plurality of documents 1 ,d 1 ),(t 2 ,d 2 )...(t n ,d n ) And (t) 1 ,d 1 ) Refer to document d 1 Time attribute t of (2) 1
4.2 Using a bag-of-words model to count high-frequency word collections for the text data. Word frequency statistics is carried out after the word segmentation of each document, wherein the maximum capacity of a word bag is set to be MAX_f according to expert opinions, the lowest frequency of the words is set to be min_df, the words which do not appear in any document and have few occurrence times are filtered, and the maximum frequency of the words is max_df, and words with abnormal occurrence frequencies, such as words like the conjunctions of 'yes'. Forming a high-frequency vocabulary dictionary through a word bag model;
4.3 Based on dictionary, calculate TF i ,TF i Word frequency information of words in the ith document and in dictionary, wherein the words are in dictionary; and according to the time attribute of the document, TF formed by one document is formed i Vector multiplication by the document time weight lambda i ,λ i Is set by the instruction of an economic expert, wherein the more new time is weighted more, the TF is obtained i ' i.e. the word frequency vector of the i th document after correction after considering the time attribute of the document;
4.4 Calculating the inverse document frequency IDF, the calculation formula being IDF (a) =ln ((1+n)/n) a ) Wherein a denotes a word, n denotes the total number of documents, n a The total number of documents in which term a appears; combining the TF' vector of each document with the IDF of the vocabulary to obtain the frequency delta of the term in dictionary taking into account the time attribute in the whole D j A threshold min_dic is set for filtering delta again j Words smaller than min_dic finally obtain the nearest high-frequency vocabulary; wherein delta j Refers to the final frequency information of the j-th word.
Step 5), if new data exist, re-executing the steps 1-4 by using the sliding window; stopping when no new data exists;
the specific method comprises the following steps:
5.1 Storing the obtained features and hot spot vocabulary;
5.2 If there is newly obtained data, re-executing step 1-4 by utilizing the sliding window principle, otherwise stopping processing.
The time-series data described by the present inventor means various data accumulated by the enterprise itself and useful for the development of the enterprise. Examples include: such as 4 types of time series data of income data, financing data, liability data, and user quantity variation data in units of years; in the years, the sales data is month by month and the sales data is day by day in five years.
Example 1:
examples: suppose that company B wants to obtain information from its historical data that is useful for the strategic decision of the enterprise's development. The time series data mainly comprises 4 types of time series data such as income data, financing data, liability data, user quantity change data and the like of the last five years; daily sales data over five years, and own meeting records and many news stories.
Firstly, classifying data according to time granularity by numerical data, wherein macroscopic data Tg is 4-class annual data, and microscopic data Ts is daily data; then, macroscopic data are processed, and industries can be divided into four categories of stars, thin dogs, questions and cattle according to Boston matrix in economy, wherein the theory is used for company classification. Then the data is z-normalized, and then the distance Dist (X j ,Y ij )(X j Refers to the sample time series data normalized by the j-th strip, j=1, 2, 3, 4, y ij Refers to the j standard data that is homologous to sample data in the i-th industry).
And normalizing the distances to be used as the input of a D-S evidence theory, and obtaining the comprehensive probability omega i (i refers to the ith class in the Boston matrix) of the company belonging to the classes calculated by the samples through evidence fusion, so as to obtain the class characteristics of the company from a macroscopic perspective.
Reprocessing the microscopic data, assuming a highly identified, complete set of best shapelets S in the last five years of day sales data for a standard business has been obtained for comparison, there are k shapelets, and each shapelet has a length L i Then, the distances between the k shapelets and the sample time series are calculated circularly, and the length L in Ts is calculated repeatedly i Subsequence and s of (2) i The Euclidean distance between the two points is finally determined i
Then calculate s i Weights u to Ts i Since the smaller the distance is, the more similar the two sequences are, the weight is multiplied by the distance and k such sequences are again multipliedThe trend feature is obtained, the smaller the result is, the closer the sales of company B is to the sales of the company it compares.
Finally, for meeting records and external news reports of the company B in the last five years, a word bag model and an improved TF_IDF are utilized to obtain hot words in the latest time, the words can suggest the attention points of the company B in the recent years and the attention points of the outside, after the data processing is finished, if new data are to be processed continuously, the data are processed again in this way by utilizing the sliding window principle, otherwise, the data are not processed any more and the mined information is applied to decision support.

Claims (5)

1. The feature extraction method suitable for the time series data in the cultural science and technology fusion field is characterized by comprising the following steps of: the method comprises the following steps:
step 1), obtaining time sequence data from a target database, classifying the sequences by data types to obtain text data and numerical data;
step 2), classifying the numerical data according to time granularity to obtain macroscopic time sequence data and microscopic time sequence data; after macroscopic data is standardized, calculating the similarity between a sample and industry standard data, and carrying out evidence fusion by taking the normalized similarity as the input of a D-S evidence theory to obtain class characteristics;
the specific method comprises the following steps:
2.1 Classifying the numerical time series data according to the time granularity, and dividing the numerical time series data into macro data Tg with large time granularity and micro data Ts with small time granularity; wherein, the macroscopic data comprises time series data of a plurality of sources, namely financial data and market data reflecting the business operation condition of enterprises from different angles, and the microscopic data only comprises one type of time series data, namely the most main index data for measuring the business operation capability of enterprises;
2.2 For macroscopic data, the data is normalized by z-normalization, and the Euclidean distance between the sample and industry standard data classified according to the Boston matrix is calculated, wherein the formula is that
Wherein the method comprises the steps ofm is the number of data in a time sequence, delta x Variance of X, delta y Variance of Y, mu x Mean value of X, mu y X is the mean value of Y, Y is the standardized sample data, Y is the standard data, X is the specific value at a certain moment of any X time sequence, and Y is the specific value at a certain moment of any Y time sequence;
2.3 Normalized data, taking the normalized distance weight as the input of a D-S evidence theory matrix, and then carrying out evidence fusion according to the D-S evidence theory principle to obtain a new comprehensive support degree omega i I refers to the i-th category of the industry;
2.4 According to the new comprehensive support degree, obtaining the comprehensive weight of the category to which the sample belongs on the macro level, and measuring the classification characteristic of the time sequence;
step 3), setting an optimal shape set of the obtained standard time sequence, and calculating the distance between a microscopic data sample and each shape to obtain trend characteristics;
step 4), text data, namely firstly using a word bag model to obtain a high-frequency word set, and then using an improved TF-IDF to carry out secondary filtration on the word set to obtain hot words;
step 5), if new data exist, re-executing the steps 1-4 by using the sliding window; and stopping when no new data exists.
2. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:
in the step 1), the specific method is as follows:
1.1 Acquiring source data): obtaining time series data from an enterprise database or a public database of a corresponding government;
1.2 Classifying the data according to the data types, and dividing the data into numerical time series data and text time series data;
1.3 A new classified data source is formed for the new data using the sliding window principle.
3. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:
in the step 3), the specific method is as follows:
3.1 For microscopic time series data, it is assumed that k shapelets for which standard data has been obtained are written as s=<s 1 ,s 2 ,...s k >And s i Length L i The method comprises the steps of carrying out a first treatment on the surface of the Circularly calculating the distances between the k shapelets and the sample time sequence, wherein the distances are defined as
dist i =min(dist(Sub(Ts) Li ,s i ))(i=1,2...k)
Wherein Sub (Ts) Li Refers to a time series Ts of length L i Is a subsequence of (a);
3.2 Calculating the weight u i =L i /(L 1 +L 2 +...+L k );
3.3 Multiplying the obtained weight by the reciprocal of the distance and then adding, and obtaining the trend characteristic of the microscopic data according to the meaning of the distance and the property of the shape.
4. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:
in the step 4), the specific method is as follows:
4.1 For text time series, each document is added with a time attribute according to the obtained time, namely, a data set D= { (t) containing a plurality of documents 1 ,d 1 ),(t 2 ,d 2 )...(t n ,d n ) And (t) 1 ,d 1 ) Refer to document d 1 Time attribute t of (2) 1
4.2 Using a bag of words model to count the high frequency vocabulary for the text data: counting word frequency after word segmentation of each document, setting the maximum capacity of a word bag as MAX_f, setting the minimum frequency of words as min_df, filtering words which do not appear in any document and have few occurrence times, and setting the maximum frequency of words as max_df, and filtering words with abnormal occurrence frequency; forming a high-frequency vocabulary dictionary through a word bag model;
4.3 Based on dictionary, calculate TF i ,TF i The TF vector formed by a document is multiplied by the document time weight lambda according to the time attribute of the document i Wherein the more recent the time, the more weight, to obtain TF i ’,TF i ' the word frequency vector of the ith corrected document is obtained by considering the time attribute of the document;
4.4 Calculating the inverse document frequency IDF, the calculation formula being IDF (a) =ln ((1+n)/n) a ) Wherein a denotes a word, n denotes the total number of documents, n a The total number of documents in which term a appears; combining the TF' vector of each document with the IDF of the vocabulary to obtain the frequency delta of the term in dictionary taking into account the time attribute in the whole D j A threshold min_dic is set for filtering delta again j Words smaller than min_dic finally obtain the nearest high-frequency vocabulary; wherein delta j Refers to the final frequency information of the j-th word.
5. The feature extraction method for time series data applicable to the field of cultural science and technology fusion according to claim 1, wherein the feature extraction method comprises the following steps:
in the step 5), the specific method is as follows:
5.1 Storing the obtained features and hot spot vocabulary;
5.2 If there is newly obtained data, re-executing step 1-4 by utilizing the sliding window principle, otherwise stopping processing.
CN202010453118.5A 2020-05-26 2020-05-26 Feature extraction method suitable for time series data in cultural science and technology fusion field Active CN111625578B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010453118.5A CN111625578B (en) 2020-05-26 2020-05-26 Feature extraction method suitable for time series data in cultural science and technology fusion field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010453118.5A CN111625578B (en) 2020-05-26 2020-05-26 Feature extraction method suitable for time series data in cultural science and technology fusion field

Publications (2)

Publication Number Publication Date
CN111625578A CN111625578A (en) 2020-09-04
CN111625578B true CN111625578B (en) 2023-12-08

Family

ID=72259271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010453118.5A Active CN111625578B (en) 2020-05-26 2020-05-26 Feature extraction method suitable for time series data in cultural science and technology fusion field

Country Status (1)

Country Link
CN (1) CN111625578B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818215A (en) * 2021-01-12 2021-05-18 平安科技(深圳)有限公司 Product data processing method, device, equipment and storage medium
CN112632231A (en) * 2021-01-20 2021-04-09 江苏思远集成电路与智能技术研究院有限公司 Feature extraction method suitable for time sequence data in cultural science and technology fusion field
CN116304114B (en) * 2023-05-11 2023-08-04 青岛市黄岛区中心医院 Intelligent data processing method and system based on surgical nursing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109101477A (en) * 2018-06-04 2018-12-28 东南大学 A kind of enterprise's domain classification and enterprise's keyword screening technique
CN110019421A (en) * 2018-07-27 2019-07-16 山东大学 A kind of time series data classification method based on data characteristics segment
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5284990B2 (en) * 2010-01-08 2013-09-11 インターナショナル・ビジネス・マシーンズ・コーポレーション Processing method for time series analysis of keywords, processing system and computer program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
CN108763213A (en) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) Theme feature text key word extracting method
CN109101477A (en) * 2018-06-04 2018-12-28 东南大学 A kind of enterprise's domain classification and enterprise's keyword screening technique
CN110019421A (en) * 2018-07-27 2019-07-16 山东大学 A kind of time series data classification method based on data characteristics segment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
改进高维数据相似度的目标意图识别方法;曹思远;刘以安;薛松;;传感器与微系统(第05期);第25-28页 *
时间序列早期分类综述;马超红;翁小清;;微型机与应用(第16期);全文 *

Also Published As

Publication number Publication date
CN111625578A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN107577785B (en) Hierarchical multi-label classification method suitable for legal identification
Wei et al. Discovering bank risk factors from financial statements based on a new semi‐supervised text mining algorithm
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
Liu et al. Combining enterprise knowledge graph and news sentiment analysis for stock price prediction
Park et al. Explainability of machine learning models for bankruptcy prediction
CN104851025A (en) Case-reasoning-based personalized recommendation method for E-commerce website commodity
Liang et al. A stock time series forecasting approach incorporating candlestick patterns and sequence similarity
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN109783633B (en) Data analysis service flow model recommendation method
CN113256409A (en) Bank retail customer attrition prediction method based on machine learning
CN109063983B (en) Natural disaster damage real-time evaluation method based on social media data
CN117971808B (en) Intelligent construction method for enterprise data standard hierarchical relationship
Ransing et al. Screening and Ranking Resumes using Stacked Model
CN113569048A (en) Method and system for automatically dividing affiliated industries based on enterprise operation range
Akça et al. Predicting acceptance of the bank loan offers by using support vector machines
Sudha Semi supervised multi text classifications for telugu documents
CN107423759B (en) Comprehensive evaluation method, device and application of low-dimensional successive projection pursuit clustering model
Plaue Data Science: An Introduction to Statistics and Machine Learning
CN115545437A (en) Financial enterprise operation risk early warning method based on multi-source heterogeneous data fusion
CN111274404B (en) Small sample entity multi-field classification method based on man-machine cooperation
Di Vincenzo et al. A text analysis for Operational Risk loss descriptions
Zhang et al. Credit Scoring model based on kernel density estimation and support vector machine for group feature selection
CN110598192A (en) Text feature reduction method based on neighborhood rough set
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
CN110609961A (en) Collaborative filtering recommendation method based on word embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant