CN113988087A - Technical subject multi-index calculation and trend prediction method and device - Google Patents

Technical subject multi-index calculation and trend prediction method and device Download PDF

Info

Publication number
CN113988087A
CN113988087A CN202111250505.XA CN202111250505A CN113988087A CN 113988087 A CN113988087 A CN 113988087A CN 202111250505 A CN202111250505 A CN 202111250505A CN 113988087 A CN113988087 A CN 113988087A
Authority
CN
China
Prior art keywords
technical
subject
technical subject
degree
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111250505.XA
Other languages
Chinese (zh)
Inventor
李玥
仇瑜
唐杰
刘德兵
褚晓泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhipu Huazhang Technology Co ltd
Original Assignee
Beijing Zhipu Huazhang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhipu Huazhang Technology Co ltd filed Critical Beijing Zhipu Huazhang Technology Co ltd
Priority to CN202111250505.XA priority Critical patent/CN113988087A/en
Publication of CN113988087A publication Critical patent/CN113988087A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Abstract

The invention provides a method and a device for technical subject multi-index calculation and trend prediction, wherein the method comprises the following steps: acquiring a plurality of thesis text data, and preprocessing key fields of the plurality of thesis text data, wherein the key fields comprise: paper title, abstract and keywords; performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical theme extraction on the weighted multi-feature fields by adopting an LDA theme model; based on the extracted technical subject, a plurality of index measurement methods for measuring the advancement of the technical subject are designed, and index values of technical subject indexes are counted, wherein the technical subject indexes comprise: strength, stability, emerging degree, frontier degree; and predicting the development trend of the technical subjects by adopting a Logistic model based on the index values, and carrying out comparative analysis on the technical subjects. The method extracts multiple features of the text, makes definite subject division, and forms a general technical subject index calculation and trend prediction method.

Description

Technical subject multi-index calculation and trend prediction method and device
Technical Field
The invention relates to the field of text theme recognition and trend prediction, in particular to a technical theme multi-index calculation and trend prediction method and device.
Background
The method has the advantages that technical subject recognition and trend prediction are conducted, academic development trends in various fields are obtained through insights, technical trend analysis tools such as scientific hotspot analysis and scientific trend analysis are formed, scientific decisions of various levels are supported, technical support is provided for occupying global scientific and technological advancement points and achieving world scientific and technological great-country targets, and the method has important significance for promoting scientific and technological innovation, breaking international monopoly and filling domestic blank. Academic papers, as one of the main carriers of scientific achievements, contain a large number of scientific research topics, analyze the technological subject evolution process and trend of papers in specific research fields, and macroscopically grasp the technological development context, clarify the technological evolution trend and the development stage of key technologies, and are becoming important driving factors influencing government decision-making and industrial development. How to rapidly and accurately extract and identify the leading-edge hot spot of the research in the subject field from scientific research literature has important research significance for the current scientific research work.
In the aspect of technical topic identification, the existing research methods mainly include a common word analysis method, a word frequency analysis method, a common introduced analysis method, a content analysis method, a social network, a topic model and the like. The topic model is an unsupervised learning method capable of effectively capturing the implicit topics of the document, and is widely applied to the field of text analysis at present. The Latent Dirichlet Allocation (LDA) topic model can well simulate semantic information of large-scale corpora, can overcome tag limitation and semantic ambiguity, and alleviate problems of data multidimensional property and sparsity and the like, and is concerned by researchers in the field of topic identification and research of thesis.
In the aspect of technical subject trend prediction, a Logistic model is a multivariate statistical analysis method based on independent variables and dependent variables, and researchers widely use the Logistic model to perform technical subject prediction analysis.
Although the earlier research has made certain progress in the technical subject identification and prediction field, the early research provides important reference for national science and technology strategic arrangement and the front-end problem selection of science and technology workers. However, there are still some problems with the current research. In the current topic model of research thesis, a single abstract information is mostly adopted to extract the technical topics in a specific field, which easily causes the extracted technical topics to be inaccurate and correlated.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
The technical subject multi-index calculation and trend prediction method solves the technical problems that in a current research thesis subject model, most of technical subjects in a specific field are extracted by adopting single abstract information, and the extracted technical subjects are inaccurate and correlated.
Therefore, a first objective of the present invention is to provide a method for calculating multiple indexes and predicting a trend of a technical subject, including:
acquiring a plurality of thesis text data, and preprocessing key fields of the plurality of thesis text data; wherein the key field comprises: paper title, abstract and keywords;
performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical theme extraction on the weighted multi-feature fields by adopting an LDA theme model;
designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and counting index values of technical theme indexes; wherein the technical subject index comprises: strength, stability, emerging degree, frontier degree;
and predicting the development trend of the technical subjects by adopting a Logistic model based on the index values, and performing comparative analysis on the technical subjects.
In addition, the technical subject multi-index calculation and trend prediction method according to the above embodiment of the present invention may further have the following additional technical features:
further, the preprocessing operation comprises: and performing multiple types of pure text format conversion, punctuation mark removal, number elimination, word segmentation processing and stop word filtering on the acquired paper text data.
Optionally, the method updates the hyper-parameters of the LDA topic model by Gibbs sampling, and determines the number of optimal technical topics by using the confusion.
Further, the multi-feature weight calculation measures the weight of the multi-feature fields by adopting the similarity degree between the multi-feature fields, and calculates the cosine similarity between the multi-feature fields; wherein the relative importance degree calculation formula of the multi-feature field is as follows:
Figure BDA0003322427140000021
Figure BDA0003322427140000022
Figure BDA0003322427140000023
wherein the weight of the multi-feature field is proportional to the correlation degree of the multi-feature field, nmThe number of terms contained in the document m is shown; n ismiFor a document m field TmiTraversing the average value of the occurrence times of all terms; mu.smjThe calculation formula is the same, i, j is 1,2, 3.
Further, the technical theme extraction of the weighted multi-feature fields by using the LDA theme model includes:
according to a prior probability P (d)m) Selecting a document dm
Sampling from Dirichlet distribution (Dirichlet) alpha to generate the document dmSubject distribution of
Figure BDA00033224271400000314
Distributed from the subject
Figure BDA00033224271400000315
Sampling to generate the document dmTopic z of nth wordm,nSelecting the theme with the maximum probability value;
sampling from a Dirichlet distribution (Dirichlet) β to generate the topic zm,nCorresponding distribution of terms
Figure BDA0003322427140000031
From the distribution of terms
Figure BDA0003322427140000032
Generating terms w by mid-samplingm,nSelecting the term with the maximum probability value;
the above process is iterated until a text set is generated.
Further, the index value of the statistical technical subject index comprises:
calculating the technical subject intensity by the following calculation formula:
Figure BDA0003322427140000033
wherein the content of the first and second substances,
Figure BDA0003322427140000034
represents the intensity of the kth year under the kth technical subject;
Figure BDA0003322427140000035
representing the number of documents contained in the kth year under the kth technical subject, wherein K is more than or equal to 1 and less than or equal to K; mtRepresenting the number of documents contained in the t-th year;
calculating the stability of the technical subject by the following calculation formula:
Figure BDA0003322427140000036
wherein the content of the first and second substances,
Figure BDA0003322427140000037
represents the stability of the kth technical subject in the t year;
Figure BDA0003322427140000038
indicating that all time periods before the t year under the kth technical subject contain the standard deviation of the number of documents;
Figure BDA0003322427140000039
an average value indicating the number of documents included in all time periods before the t year under the kth subject;
calculating the technical subject emerging degree, wherein the calculation formula is as follows:
Figure BDA00033224271400000310
wherein the content of the first and second substances,
Figure BDA00033224271400000311
represents the degree of new development in the t year under the k technical theme;
Figure BDA00033224271400000312
represents the summation of the publication years of the documents in all time periods before the t year under the kth technical subject;
Figure BDA00033224271400000313
representing the number of documents contained in all time periods before the t-th year;
the technical subject frontier degree is contribution degree + influence degree, and the calculation formula is as follows:
the contribution degree is the core paper share + the introduction paper share; core paper share is core paper number/total number of leading edge core papers; the share of the introduction paper is the number of introduction papers/the total number of leading edge introduction papers;
influence degree is core paper introduced frequency share + applied paper introduced frequency share; the core paper is divided into the guided frequency of the core paper/the guided frequency of the leading edge core paper; the introduction paper is introduced with a frequency share, i.e. introduction paper is introduced with a frequency/leading edge introduction paper is introduced with a frequency.
Further, the predicting the development trend of the technical subject by using the Logistic model comprises the following steps:
the Logistic model tracks the nonlinear variation trend of a time sequence, and the development rate changes of the slope fitting technology in different stages in the germination period, the growth period, the maturation period and the decline period are as follows:
Figure BDA0003322427140000041
wherein, ytAnd t represents technical subject index variables and time variables, such as four measurement indexes of the strength, stability, emerging degree and frontier degree of the technical subject; b is the maximum saturation value that the curve can reach,
Figure BDA0003322427140000042
a represents the slope of the curve and,
Figure BDA0003322427140000043
and I represents a time node of concave-convex transition of the curve, and I is more than 0.
Further, the updating the hyper-parameters of the LDA topic model by using Gibbs sampling includes:
and estimating the document-topic probability distribution and the topic-term probability distribution by adopting Gibbs sampling, wherein the parameter estimation calculation formulas of the two distributions are as follows:
Figure BDA0003322427140000044
Figure BDA0003322427140000045
wherein the content of the first and second substances,
Figure BDA0003322427140000046
the number of times of occurrence of a topic k for the mth text in the document-topic distribution; alpha is alphakDirichlet prior probability distribution for document-topic;
Figure BDA0003322427140000047
the number of the corresponding term t of the kth topic in the topic-term is taken as the number of the term t; beta is atIs the Dirichlet prior probability distribution of topic-terms.
Further, the determining the number of the optimal technical topics by using the confusion degree comprises:
applying the confusion to the LDA model to generate a confusion curve, wherein the number of topics corresponding to the lowest point or inflection point of the confusion curve is the determined optimal number, and the calculation formula is as follows:
Figure BDA0003322427140000048
to achieve the above object, a second aspect of the present invention provides a technical subject multi-index calculating and trend predicting device, including:
the preprocessing module is used for acquiring a plurality of thesis text data and preprocessing key fields of the plurality of thesis text data; wherein the key field comprises: paper title, abstract and keywords;
the extraction module is used for carrying out multi-feature weight calculation according to the key fields after the preprocessing operation and carrying out technical subject extraction on the weighted multi-feature fields by adopting an LDA subject model;
the statistical module is used for designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme and counting the index values of the technical theme indexes; wherein the technical subject index comprises: strength, stability, emerging degree, frontier degree;
and the prediction module is used for predicting the development trend of the technical subjects by adopting a Logistic model based on the index values and carrying out comparative analysis among the technical subjects.
The technical subject multi-index calculation and trend prediction device solves the technical problems that in the current research thesis subject device, most of technical subjects in specific fields are extracted by adopting single abstract information, and the extracted technical subjects are inaccurate and correlated.
The invention has the beneficial effects that: the method is used for identifying the technical subject identification and the trend prediction in the specific research field, can more comprehensively and accurately disclose the hot research and development situation, provides a new visual angle for the theoretical research of the technical subject identification and the trend prediction, and also provides method reference and decision support for the planned strategic layout of scientific research personnel, enterprises and the like.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of a method for calculating multiple indexes and predicting a trend of a technical subject according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an LDA graph model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a technical subject multi-index calculation and trend prediction apparatus according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a technical subject multi-index calculation and trend prediction method and apparatus according to an embodiment of the present invention with reference to the drawings, and first, a technical subject multi-index calculation and trend prediction method according to an embodiment of the present invention will be described with reference to the drawings.
Fig. 1 is a flowchart of a method for calculating multiple indexes and predicting a trend according to an embodiment of the present invention.
As shown in fig. 1, the method for calculating multiple indexes and predicting a trend of a technical subject includes the following steps:
step S1, acquiring a plurality of paper text data, and preprocessing key fields of the plurality of paper text data; wherein, the key field includes: paper title, abstract and keywords.
Specifically, the invention establishes a corpus by taking academic paper data in the field of artificial intelligence as research objects, including experimental data of titles, abstracts, keywords, authors, institutions, publication time, quoted amount and the like of the papers. The method aims to explore the research frontier and the future development situation of the research frontier in the field of artificial intelligence and provide reference for other research frontier analysis work.
The title, abstract and keyword content of the paper characterize the technical subject information related to the specific field of the paper, and the technical information of the paper characterized by the semi-structured documents is the focus of research. Before inputting text data into an LDA topic model, the title, abstract and key words of a thesis need to be preprocessed respectively to form basic data for topic identification in the text. The data preprocessing work mainly comprises the steps of pure text format conversion of the acquired text data, punctuation mark removal, digital elimination, word segmentation processing, word filtering stop and the like, and provides good experimental data for subsequent analysis.
And step S2, performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical subject extraction on the weighted multi-feature fields by adopting an LDA subject model.
Specifically, assume that a preprocessed text corpus D containing M texts is to be generated, and a set D ═ D of the M texts1,d2,…,dMIs covered by 3 characteristic fields T1,T2,T3Can be expressed as a matrix of M × 3 characteristic fields (table 1), dm ═ Tm1, Tm2, Tm3),1≤m≤M,d∈D。T1,T2,T3Respectively, title, abstract and keyword fields of the paper, wherein T1,T2,T3Comprising a plurality of terms.
Figure BDA0003322427140000061
TABLE 1 text set feature matrix
Because the technical subject importance degrees and the representing abilities contained in different feature fields of the thesis are different, the embodiment of the invention constructs the privilege weight of the text for the corpus, so that the identification difficulty of the subject can be effectively reduced, the interpretability of the subject is increased, the division of the subject is clearer, and the generalization ability of the subject identification is improved. The weight synthesis of the characteristic field reflects the contribution degree and the distinguishing capability of the characteristic to the text content, and the calculation of the characteristic field weight has great influence on the text theme extraction effect. The corpus that finally constitutes the LDA model is D ═ aT1, bT2, cT3, and a, b, c are the weights of the paper title, abstract and keyword, respectively, and the sum of the feature field weights is 1. The stronger the correlation between the feature fields, i.e. the better the feature expression. Therefore, it is necessary to increase the weight of the feature field with strong expression ability and decrease the weight of the feature field with poor expression ability. And measuring the weight of the characteristic fields by adopting the similarity degree between the characteristic fields, and calculating the cosine similarity between the characteristics according to the TF-IDF thought, wherein the relative importance calculation formula of the characteristic fields is as follows:
Figure BDA0003322427140000062
Figure BDA0003322427140000071
Figure BDA0003322427140000072
in the formula (I), the compound is shown in the specification,the weight of a feature field is proportional to the relevance of the feature field. n ismThe number of terms contained in the document m is shown; n ismiFor a document m field TmiTraversing the average value of the occurrence times of all terms; mu.smjThe calculation formula is the same, i, j is 1,2, 3.
Further, in step S2, an LDA topic model is used to perform technical topic extraction on the weighted multi-feature fields.
It can be understood that the LDA model is an unsupervised machine learning method, can be used for identifying hidden topic information in a large-scale document set or corpus, and is currently widely applied in the fields of text mining and the like. The LDA model is a three-layer variable parameter level Bayes model, and each document is assumed to be formed by mixing a plurality of subjects, and each subject is formed by probability distribution of a plurality of words. The topic discovery of the text and the generation of the text are a reciprocal process, the topic discovery of the text is to discover topics described by terms from the text and associations among different topics, and the same text can have a plurality of topics; the text is generated by selecting terms matched with the theme from the word stock according to the theme to form the text. The paths of the two are opposite, and the LDA model realizes a solution of the text generation inverse problem through the prior parameters.
The core problem when the LDA model is adopted to carry out document theme modeling is to estimate the probability distribution condition of hidden variables, namely to obtain the hidden theme distribution in a target document and the term distribution of each hidden theme, which is helpful to further analyze the development trend of technical themes. The content of the text set D weighted by the characteristic field is related to K subjects (K is more than or equal to 1 and less than or equal to K), the term used by each text is from the same set containing V elements (V is an independent and identically distributed different repeated word in a word bag), and each document m contains Nm(1≤n≤Nm) Individual terms (text may be repeated words). A document generation step is as follows:
(1) according to a prior probability P (d)m) Selecting a document dm
(2) Sampling from Dirichlet distribution (Dirichlet) alpha to generate document dmSubject distribution of
Figure BDA0003322427140000077
I.e. distribution of topics
Figure BDA0003322427140000076
Generated from a Dirichlet distribution with a hyper-parameter a. The dimensionality of the text-to-topic distribution is M × K, as shown in table 2.
Figure BDA0003322427140000073
TABLE 2 text-topic distribution
(3) Polynomial distribution from topics
Figure BDA0003322427140000078
Sampling to generate document dmTopic z of nth wordm,nThe topic with the highest probability value is selected according to the distribution.
(4) Sampling from Dirichlet distribution (Dirichlet) beta generates topic zm,nCorresponding distribution of terms
Figure BDA0003322427140000074
I.e. distribution of terms
Figure BDA0003322427140000075
Generated from a Dirichlet distribution with a hyper-parameter of beta. The dimensionality of the distribution of topic-terms is K NmAs shown in table 3.
Figure BDA0003322427140000081
TABLE 3 topic-term distributions
(5) From polynomial distribution of words
Figure BDA0003322427140000082
Final generation term w by intermediate samplingm,nThe term with the highest probability value is selected according to the distribution.
(6) The above process is iterated until a text set is generated.
The LDA topic model considers a document as a collection of keywords, and does not consider any grammatical or word occurrence order relationship in the process of constructing the topic model, and a bayesian network diagram of the document is generated by using the model as shown in fig. 2.
In FIG. 2, the rectangles represent the scope of the in-frame dependent variables m, n, k; the values of the double circles represent known quantities that can be observed experimentally; the hyperparameter alpha is a parameter of a prior distribution Dirichlet distribution of a topic distribution of each document, beta is a parameter of a prior distribution Dirichlet distribution of a term distribution of each topic, and alpha and beta have smoothing effects on multiple parameters in the Dirichlet distribution. The value of the hyper-parameter threshold affects the acquisition of a plurality of terms distributed under each topic, and finally affects the accuracy of the algorithm, wherein the optimal value is alpha-0.5 and beta-0.01 according to experience. The random variable theta represents a theme distribution vector in the target document and a parameter to be estimated
Figure BDA0003322427140000083
The random variable phi represents a distribution vector of terms corresponding to a target subject, and a parameter to be estimated
Figure BDA0003322427140000084
Implicit variable zm,nA topic vector representing that the target document m is distributed on the nth word is used for reflecting the potential relation between the document and the term; w is am,nRepresenting the nth word in the mth document.
The LDA generation of the text set is mainly divided into two steps,
Figure BDA00033224271400000810
is shown as passing through
Figure BDA00033224271400000811
Document-topic Dirichlet prior probability solution
Figure BDA00033224271400000812
A posterior probability process;
Figure BDA0003322427140000085
is shown as passing through
Figure BDA0003322427140000086
Topic-term Dirichlet prior probability solution
Figure BDA0003322427140000087
And (4) posterior probability process. Can be based on a large number of known document-term information P (w)n|dm) Training out a document-topic P (z)k|dm) And topic-term P (w)n|zk) The following formula:
Figure BDA0003322427140000088
therefore, the generation probability of each word in the obtained document is:
Figure BDA0003322427140000089
for a given corpus, the LDA topic modeling process is to estimate P (w)n|zk) And P (z)k|dm) The parameter (c) of (c). Due to P (d)m) Is a document dmIs the document dmThe continuous product of the occurrence probabilities of all the terms in the dictionary can be calculated in advance. And P (w)n|zk) And P (z)k|dm) As latent variables are not available by direct calculation, a common method is parameter estimation by Gibbs sampling (Gibbs). Gibbs sampling is an algorithm used in Markov Chain Monte Carlo (MCMC) to obtain a series of observation samples approximately equal to a specified multidimensional probability distribution, and can eliminate the influence of prior parameters on the result. And estimating the document-theme probability distribution and the theme-term probability distribution by adopting Gibbs sampling, wherein the parameter estimation calculation formulas of the two distributions are as follows:
Figure BDA0003322427140000091
Figure BDA0003322427140000092
in the formula (I), the compound is shown in the specification,
Figure BDA0003322427140000093
the number of times of occurrence of a topic k for the mth text in the document-topic distribution; alpha is alphakDirichlet prior probability distribution for document-topic;
Figure BDA0003322427140000094
the number of the corresponding term t of the kth topic in the topic-term is taken as the number of the term t; beta is atIs the Dirichlet prior probability distribution of topic-terms.
P(dm,wn) The core of the mathematical principle of the LDA model is that the information found by the text subject is concentrated in the two probability distributions. However, the result calculated by the formula has uncertainty, because the formula contains a priori parameters, and therefore continuous iteration is needed to obtain stable text-topic distribution and topic-term distribution, and the two distribution updating rules are obtained through Gibbs sampling, so that the LDA model is determined.
Further, an LDA model input text is obtained after text data are preprocessed, characteristic weighting is carried out on key fields, an LDA topic model is used for modeling, and LDA model parameters are updated through Gibbs sampling. Although the LDA model is constructed, the number of technical topics in the specific field of the thesis cannot be directly determined by the model, and the distribution of the extracted topics is greatly influenced by the number of topics. If the granularity of the selected theme is too large, a plurality of themes without obvious semantic information classification can be generated, and some theme details cannot be effectively concerned; if the number of the subjects is too large, excessive separation of the subject information is easily caused. Therefore, how to scientifically determine the number of technical subjects is the key of research.
The embodiment of the invention adopts the perplexity to determine the optimal number of the topics, overcomes subjectivity and anthropogenic property, and can objectively and consistently represent the implicit topics actually contained in the analysis object. The confusion degree is used for judging the quality degree of a probability distribution or probability model prediction sample, generally, a descending rule is presented along with the increase of the number of potential subjects, and the smaller the confusion degree value is, the better the generating capability of the subject model is, otherwise, the weaker the generating capability is. The number of topics corresponding to the lowest point or inflection point of the confusion curve is the determined optimal number of topics, and the calculation formula applied to the LDA model is as follows:
Figure BDA0003322427140000095
step S3, designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and counting the index values of the technical theme indexes; wherein, the technical subject index comprises: strength, stability, emerging degree, frontier degree;
specifically, on the basis of the identified document research technical theme, four index calculation formulas of the strength, the stability, the emerging degree and the frontier degree of the technical theme are designed, and the technical theme index of each year is counted, so that the evolution trend of the technical theme is compared more visually.
(1) Subject matter intensity
The strength of a technical subject (strength) describes the hot degree of the technical subject, and the more documents related to a certain subject at a certain moment, the higher the strength of the subject is illustrated, and the higher the research on the technical subject is illustrated. The calculation formula is as follows:
Figure BDA0003322427140000101
in the formula (I), the compound is shown in the specification,
Figure BDA0003322427140000102
represents the intensity of the kth year under the kth technical subject;
Figure BDA0003322427140000103
indicating the number of documents contained in the kth year under the kth technical subject,1≤k≤K;Mtindicating the number of documents contained in the t-th year.
(2) Stability of technical subject
The stability of a technical theme describes the development and fluctuation of a technical theme over a period of time. If the development of the technical theme is slow or the fluctuation is small, the stability is relatively good; if the technical theme develops violently or fluctuates greatly, the stability is poor. The specific definition is as follows:
Figure BDA0003322427140000104
in the formula (I), the compound is shown in the specification,
Figure BDA0003322427140000105
represents the stability of the kth technical subject in the t year;
Figure BDA0003322427140000106
indicating that all time periods before the t year under the kth technical subject contain the standard deviation of the number of documents;
Figure BDA0003322427140000107
it means that all the time periods before the t year under the kth technical subject contain the average value of the number of documents.
(3) Technical subject innovation
The technical theme emerging (emerging) describes whether technical themes contain new research content. If the year of a document in a technical theme is newer, the technical theme has higher emerging degree, and the research and application time of the technical theme is newer.
Figure BDA0003322427140000108
In the formula (I), the compound is shown in the specification,
Figure BDA0003322427140000109
represents the degree of new development in the t year under the k technical theme;
Figure BDA00033224271400001010
represents the summation of the publication years of the documents in all time periods before the t year under the kth technical subject;
Figure BDA00033224271400001011
indicating the number of documents contained in all time periods prior to the t-year.
(4) Frontier of technical subject
The frontier of the technical theme is contribution + influence.
The technical subject contribution degree is a relative share of a technical subject to the number of papers contributing to the research front, and includes a share of core papers contained in the technical subject in all core papers in the front and a share of introduction papers in all introduction papers in the front, and the specific calculation method is as follows: the contribution degree is the core paper share + the introduction paper share; core paper share is core paper number/total number of leading edge core papers; the share of the introduction paper is the number of introduction papers/total number of leading edge introduction papers.
The technical subject influence degree is the relative share of the guided frequencies of the papers contributed by the technical subject to the research frontier, and comprises the share of the guided frequencies of the core papers contained in the technical subject in the guided frequencies of all the core papers in the frontier and the share of the guided frequencies of the guided papers in the leading edge in all the guided papers, wherein the specific calculation method comprises the following steps: influence degree is core paper introduced frequency share + applied paper introduced frequency share; the core paper is divided into the guided frequency of the core paper/the guided frequency of the leading edge core paper; the introduction paper is introduced with a frequency share, i.e. introduction paper is introduced with a frequency/leading edge introduction paper is introduced with a frequency.
And step S4, predicting the development trend of the technical subjects by adopting a Logistic model, and carrying out comparative analysis on the technical subjects.
Specifically, the development trend of the technical theme and the law of subject evolution and development are predicted by observing the change conditions of the four indexes of the strength, the stability, the emerging degree and the frontier degree of the technical theme on a time axis, so that the future development of the technology can be controlled macroscopically to a certain extent, and guidance is provided for the current technical innovation. In order to display the future development situation of the technical subject, the development trend of the technical subject is predicted by adopting a Logistic model. The Logistic model belongs to an S-shaped curve, can track the nonlinear change trend of a time sequence, and the slopes of different stages can be more accurately fit with the development rate changes of the technology in the germination stage, the growth stage, the maturation stage and the decline stage. The Logistic model is widely applied to the technical trend prediction research field due to its simple form and excellent performance. The specific calculation formula is as follows:
Figure BDA0003322427140000111
in the formula, ytAnd t represents technical subject index variables and time variables, such as four measurement indexes of the strength, stability, emerging degree and frontier degree of the technical subject; b is the maximum saturation value that the curve can reach,
Figure BDA0003322427140000112
a represents the slope of the curve and,
Figure BDA0003322427140000113
i denotes the time node of the concave-convex transition of the curve, I>0。
Through the steps, the technical theme is extracted by adopting the improved LDA theme model, the Logistic model is adopted to perform curve fitting on the technical theme based on various designed technical theme index measurement formulas, a universal technical theme multi-index calculation and trend prediction method is formed, and technical theme related indexes in a large amount of thesis text data can be uniformly and accurately calculated. The method is used for identifying technical subject identification and trend prediction in a specific research field so as to more comprehensively and accurately disclose hot research and development situation, provide a new visual angle for the theoretical research of the technical subject identification and the trend prediction and provide method reference and decision support for planned strategic layout of scientific research personnel, enterprises and the like.
The technical subject is subjected to subject identification and trend prediction based on the LDA subject model and the Logistic model, and there are many places which can be improved, but no matter how the specific improvement measures are, the problems in the prior art can be solved and the corresponding effect can be obtained as long as the improvement measures can further and accurately calculate the relevant indexes of the technical subject in the thesis text data and carry out the trend prediction of the technical subject.
In order to implement the foregoing embodiments, the present embodiment further provides a technical subject multi-index calculating and trend predicting apparatus 10, as shown in fig. 3, the apparatus 10 includes: a pre-processing module 100, an extraction module 200, a statistics module 300, and a prediction module 400.
The preprocessing module 100 is configured to acquire a plurality of thesis text data and perform preprocessing operation on key fields of the plurality of thesis text data; wherein, the key field includes: paper title, abstract and keywords;
the extraction module 200 is configured to perform multi-feature weight calculation according to the key fields after the preprocessing operation, and perform technical topic extraction on the weighted multi-feature fields by using an LDA topic model;
a statistic module 300, configured to design a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and count index values of technical theme indexes; wherein, the technical subject index comprises: strength, stability, emerging degree, frontier degree;
and the prediction module 400 is used for predicting the development trend of the technical subjects by adopting a Logistic model based on the index values and performing comparative analysis among the technical subjects.
It should be noted that the foregoing explanation of the embodiment of the method for calculating multiple indexes and predicting a trend of a technical subject is also applicable to the device for calculating multiple indexes and predicting a trend of a technical subject of the embodiment, and is not repeated herein.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A technical subject multi-index calculation and trend prediction method is characterized by comprising the following steps:
acquiring a plurality of thesis text data, and preprocessing key fields of the plurality of thesis text data; wherein the key field comprises: paper title, abstract and keywords;
performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical theme extraction on the weighted multi-feature fields by adopting an LDA theme model;
designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and counting index values of technical theme indexes; wherein the technical subject index comprises: strength, stability, emerging degree, frontier degree;
and predicting the development trend of the technical subjects by adopting a Logistic model based on the index values, and performing comparative analysis on the technical subjects.
2. The method of claim 1, wherein the preprocessing comprises:
and performing multiple types of pure text format conversion, punctuation mark removal, number elimination, word segmentation processing and stop word filtering on the acquired paper text data.
3. The method of claim 1, further comprising:
and updating the hyper-parameters of the LDA theme model by adopting Gibbs sampling, and determining the number of the optimal technical themes by adopting the confusion degree.
4. The method of claim 1, wherein the multi-feature weight calculation comprises:
measuring the weight of the multi-feature fields by adopting the similarity degree between the multi-feature fields, and calculating the cosine similarity between the multi-feature fields; wherein the relative importance degree calculation formula of the multi-feature field is as follows:
Figure FDA0003322427130000011
Figure FDA0003322427130000012
Figure FDA0003322427130000021
wherein, DuteThe weight of the token field is proportional to the relevance of the multi-token field, nmThe number of terms contained in the document m is shown; n ismiFor a document m field TmiTraversing the average value of the occurrence times of all terms; mu.smjThe calculation formula is the same, i, j is 1,2, 3.
5. The technical subject multi-index calculation and trend prediction method of claim 1, wherein the technical subject extraction of the weighted multi-feature field by using the LDA subject model comprises:
according to a prior probability P (d)m) Selecting a document dm
Sampling from Dirichlet distribution (Dirichlet) alpha to generate the document dmSubject distribution theta ofm
Distributing theta from the thememSampling to generate the document dmTopic z of nth wordm,nSelecting the theme with the maximum probability value;
sampling from a Dirichlet distribution (Dirichlet) β to generate the topic zm,nCorresponding distribution of terms
Figure FDA0003322427130000022
From the distribution of terms
Figure FDA0003322427130000023
Generating terms w by mid-samplingm,nSelecting the term with the maximum probability value;
the above process is iterated until a text set is generated.
6. The method of claim 1, wherein the counting the index values of the technical subject indexes comprises:
calculating the technical subject intensity by the following calculation formula:
Figure FDA0003322427130000024
wherein the content of the first and second substances,
Figure FDA0003322427130000025
represents the intensity of the kth year under the kth technical subject;
Figure FDA0003322427130000026
representing the number of documents contained in the kth year under the kth technical subject, wherein K is more than or equal to 1 and less than or equal to K; mtRepresenting the number of documents contained in the t-th year;
calculating the stability of the technical subject by the following calculation formula:
Figure FDA0003322427130000027
wherein the content of the first and second substances,
Figure FDA0003322427130000028
represents the stability of the kth technical subject in the t year;
Figure FDA0003322427130000029
indicating that all time periods before the t year under the kth technical subject contain the standard deviation of the number of documents;
Figure FDA00033224271300000210
an average value indicating the number of documents included in all time periods before the t year under the kth subject;
calculating the technical subject emerging degree, wherein the calculation formula is as follows:
Figure FDA00033224271300000211
wherein the content of the first and second substances,
Figure FDA0003322427130000031
represents the k < th >The technical theme of the new development year t;
Figure FDA0003322427130000032
represents the summation of the publication years of the documents in all time periods before the t year under the kth technical subject;
Figure FDA0003322427130000033
representing the number of documents contained in all time periods before the t-th year;
the technical subject frontier degree is contribution degree + influence degree, and the calculation formula is as follows:
the contribution degree is the core paper share + the introduction paper share; core paper share is core paper number/total number of leading edge core papers; the share of the introduction paper is the number of introduction papers/the total number of leading edge introduction papers;
influence degree is core paper introduced frequency share + applied paper introduced frequency share; the core paper is divided into the guided frequency of the core paper/the guided frequency of the leading edge core paper; the introduction paper is introduced with a frequency share, i.e. introduction paper is introduced with a frequency/leading edge introduction paper is introduced with a frequency.
7. The technical subject multi-index calculation and trend prediction method according to claim 1, wherein predicting the technical subject development trend by using a Logistic model comprises:
the Logistic model tracks the nonlinear variation trend of a time sequence, and the development rate changes of the slope fitting technology in different stages in the germination period, the growth period, the maturation period and the decline period are as follows:
Figure FDA0003322427130000034
wherein, ytAnd t represents technical subject index variables and time variables, such as four measurement indexes of the strength, stability, emerging degree and frontier degree of the technical subject; b is the maximum saturation value that the curve can reach,
Figure FDA0003322427130000035
a represents the slope of the curve and,
Figure FDA0003322427130000036
i denotes the time node of the concave-convex transition of the curve, I>0。
8. The technical subject multi-index calculation and trend prediction method of claim 3, wherein the updating the hyper-parameters of the LDA subject model by Gibbs sampling comprises:
and estimating the document-topic probability distribution and the topic-term probability distribution by adopting Gibbs sampling, wherein the parameter estimation calculation formulas of the two distributions are as follows:
Figure FDA0003322427130000037
Figure FDA0003322427130000038
wherein the content of the first and second substances,
Figure FDA0003322427130000039
the number of times of occurrence of a topic k for the mth text in the document-topic distribution; alpha is alphakDirichlet prior probability distribution for document-topic;
Figure FDA00033224271300000310
the number of the corresponding term t of the kth topic in the topic-term is taken as the number of the term t; beta is atIs the Dirichlet prior probability distribution of topic-terms.
9. The method for multi-index calculation and trend prediction of technical topics according to claim 3, wherein the determining the number of optimal technical topics by using the confusion comprises:
applying the confusion to the LDA model to generate a confusion curve, wherein the number of topics corresponding to the lowest point or inflection point of the confusion curve is the determined optimal number, and the calculation formula is as follows:
Figure FDA0003322427130000041
10. a technical subject multi-index calculation and trend prediction device is characterized by comprising the following components:
the preprocessing module is used for acquiring a plurality of thesis text data and preprocessing key fields of the plurality of thesis text data; wherein the key field comprises: paper title, abstract and keywords;
the extraction module is used for carrying out multi-feature weight calculation according to the key fields after the preprocessing operation and carrying out technical subject extraction on the weighted multi-feature fields by adopting an LDA subject model;
the statistical module is used for designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme and counting the index values of the technical theme indexes; wherein the technical subject index comprises: strength, stability, emerging degree, frontier degree;
and the prediction module is used for predicting the development trend of the technical subjects by adopting a Logistic model based on the index values and carrying out comparative analysis among the technical subjects.
CN202111250505.XA 2021-10-26 2021-10-26 Technical subject multi-index calculation and trend prediction method and device Pending CN113988087A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111250505.XA CN113988087A (en) 2021-10-26 2021-10-26 Technical subject multi-index calculation and trend prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111250505.XA CN113988087A (en) 2021-10-26 2021-10-26 Technical subject multi-index calculation and trend prediction method and device

Publications (1)

Publication Number Publication Date
CN113988087A true CN113988087A (en) 2022-01-28

Family

ID=79741881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111250505.XA Pending CN113988087A (en) 2021-10-26 2021-10-26 Technical subject multi-index calculation and trend prediction method and device

Country Status (1)

Country Link
CN (1) CN113988087A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644338A (en) * 2023-06-01 2023-08-25 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116644338A (en) * 2023-06-01 2023-08-25 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity
CN116644338B (en) * 2023-06-01 2024-01-30 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity

Similar Documents

Publication Publication Date Title
Abrishami et al. Predicting citation counts based on deep neural network learning techniques
Singh et al. A novel approach for dimension reduction using word embedding: An enhanced text classification approach
De Battisti et al. A decade of research in statistics: A topic model approach
CN110543564B (en) Domain label acquisition method based on topic model
CN107463715A (en) English social media account number classification method based on information gain
Viloria et al. Using big data to determine potential dropouts in higher education
CN112115327A (en) Public opinion news event tracking method based on topic model
CN113988087A (en) Technical subject multi-index calculation and trend prediction method and device
Dhar et al. Bengali news headline categorization using optimized machine learning pipeline
Mahmoudi et al. Arabic language modeling based on supervised machine learning
Rahmoun et al. Experimenting N-Grams in Text Categorization.
Sharma et al. A trend analysis of significant topics over time in machine learning research
Alsheri et al. MOOCSent: a sentiment predictor for massive open online courses
Danowski et al. Scaling constructs with semantic networks
Onose et al. A Hierarchical Attention Network for Bots and Gender Profiling.
Villoing et al. Investigating the distributional properties of rival-age suffixation and verb to noun conversion in French
BAZRFKAN et al. Using machine learning methods to summarize persian texts
Wen et al. Blockchain-based reviewer selection
Yaman et al. Automated Multi-Label Classification on Fertilizer-Themed Patent Documents in Indonesia
KR20210002968A (en) System and method for generating prediction of technology transfer base on machine learning
Gupta et al. A Novel LDA-based Framework to Forecast COVID-19 Trends
Ke et al. Recent advances in text analysis
Kusuma Detection of Online Prostitution in Twitter Platform Using Machine Learning Approach
Sbalchiero et al. What’s old and new? Discovering Topics in the American Journal of Sociology
Ruponen Predicting diagnosis classes from medical text using deep transformer-based models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Li Yue

Inventor after: Chou Yu

Inventor after: Liu Debing

Inventor after: Chu Xiaoquan

Inventor before: Li Yue

Inventor before: Chou Yu

Inventor before: Tang Jie

Inventor before: Liu Debing

Inventor before: Chu Xiaoquan

CB03 Change of inventor or designer information