CN113988087A

CN113988087A - Technical subject multi-index calculation and trend prediction method and device

Info

Publication number: CN113988087A
Application number: CN202111250505.XA
Authority: CN
Inventors: 李玥; 仇瑜; 唐杰; 刘德兵; 褚晓泉
Original assignee: Beijing Zhipu Huazhang Technology Co ltd
Current assignee: Beijing Zhipu Huazhang Technology Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-28

Abstract

The invention provides a method and a device for technical subject multi-index calculation and trend prediction, wherein the method comprises the following steps: acquiring a plurality of thesis text data, and preprocessing key fields of the plurality of thesis text data, wherein the key fields comprise: paper title, abstract and keywords; performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical theme extraction on the weighted multi-feature fields by adopting an LDA theme model; based on the extracted technical subject, a plurality of index measurement methods for measuring the advancement of the technical subject are designed, and index values of technical subject indexes are counted, wherein the technical subject indexes comprise: strength, stability, emerging degree, frontier degree; and predicting the development trend of the technical subjects by adopting a Logistic model based on the index values, and carrying out comparative analysis on the technical subjects. The method extracts multiple features of the text, makes definite subject division, and forms a general technical subject index calculation and trend prediction method.

Description

Technical subject multi-index calculation and trend prediction method and device

Technical Field

The invention relates to the field of text theme recognition and trend prediction, in particular to a technical theme multi-index calculation and trend prediction method and device.

Background

The method has the advantages that technical subject recognition and trend prediction are conducted, academic development trends in various fields are obtained through insights, technical trend analysis tools such as scientific hotspot analysis and scientific trend analysis are formed, scientific decisions of various levels are supported, technical support is provided for occupying global scientific and technological advancement points and achieving world scientific and technological great-country targets, and the method has important significance for promoting scientific and technological innovation, breaking international monopoly and filling domestic blank. Academic papers, as one of the main carriers of scientific achievements, contain a large number of scientific research topics, analyze the technological subject evolution process and trend of papers in specific research fields, and macroscopically grasp the technological development context, clarify the technological evolution trend and the development stage of key technologies, and are becoming important driving factors influencing government decision-making and industrial development. How to rapidly and accurately extract and identify the leading-edge hot spot of the research in the subject field from scientific research literature has important research significance for the current scientific research work.

In the aspect of technical topic identification, the existing research methods mainly include a common word analysis method, a word frequency analysis method, a common introduced analysis method, a content analysis method, a social network, a topic model and the like. The topic model is an unsupervised learning method capable of effectively capturing the implicit topics of the document, and is widely applied to the field of text analysis at present. The Latent Dirichlet Allocation (LDA) topic model can well simulate semantic information of large-scale corpora, can overcome tag limitation and semantic ambiguity, and alleviate problems of data multidimensional property and sparsity and the like, and is concerned by researchers in the field of topic identification and research of thesis.

In the aspect of technical subject trend prediction, a Logistic model is a multivariate statistical analysis method based on independent variables and dependent variables, and researchers widely use the Logistic model to perform technical subject prediction analysis.

Although the earlier research has made certain progress in the technical subject identification and prediction field, the early research provides important reference for national science and technology strategic arrangement and the front-end problem selection of science and technology workers. However, there are still some problems with the current research. In the current topic model of research thesis, a single abstract information is mostly adopted to extract the technical topics in a specific field, which easily causes the extracted technical topics to be inaccurate and correlated.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

The technical subject multi-index calculation and trend prediction method solves the technical problems that in a current research thesis subject model, most of technical subjects in a specific field are extracted by adopting single abstract information, and the extracted technical subjects are inaccurate and correlated.

Therefore, a first objective of the present invention is to provide a method for calculating multiple indexes and predicting a trend of a technical subject, including:

acquiring a plurality of thesis text data, and preprocessing key fields of the plurality of thesis text data; wherein the key field comprises: paper title, abstract and keywords;

performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical theme extraction on the weighted multi-feature fields by adopting an LDA theme model;

designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and counting index values of technical theme indexes; wherein the technical subject index comprises: strength, stability, emerging degree, frontier degree;

and predicting the development trend of the technical subjects by adopting a Logistic model based on the index values, and performing comparative analysis on the technical subjects.

In addition, the technical subject multi-index calculation and trend prediction method according to the above embodiment of the present invention may further have the following additional technical features:

further, the preprocessing operation comprises: and performing multiple types of pure text format conversion, punctuation mark removal, number elimination, word segmentation processing and stop word filtering on the acquired paper text data.

Optionally, the method updates the hyper-parameters of the LDA topic model by Gibbs sampling, and determines the number of optimal technical topics by using the confusion.

Further, the multi-feature weight calculation measures the weight of the multi-feature fields by adopting the similarity degree between the multi-feature fields, and calculates the cosine similarity between the multi-feature fields; wherein the relative importance degree calculation formula of the multi-feature field is as follows:

wherein the weight of the multi-feature field is proportional to the correlation degree of the multi-feature field, n_mThe number of terms contained in the document m is shown; n is_miFor a document m field T_miTraversing the average value of the occurrence times of all terms; mu.s_mjThe calculation formula is the same, i, j is 1,2, 3.

Further, the technical theme extraction of the weighted multi-feature fields by using the LDA theme model includes:

according to a prior probability P (d)_m) Selecting a document d_m；

Sampling from Dirichlet distribution (Dirichlet) alpha to generate the document d_mSubject distribution of

；

Distributed from the subject

Sampling to generate the document d_mTopic z of nth word_m，nSelecting the theme with the maximum probability value;

sampling from a Dirichlet distribution (Dirichlet) β to generate the topic z_m，nCorresponding distribution of terms

From the distribution of terms

Generating terms w by mid-sampling_m，nSelecting the term with the maximum probability value;

the above process is iterated until a text set is generated.

Further, the index value of the statistical technical subject index comprises:

calculating the technical subject intensity by the following calculation formula:

wherein the content of the first and second substances,

represents the intensity of the kth year under the kth technical subject;

representing the number of documents contained in the kth year under the kth technical subject, wherein K is more than or equal to 1 and less than or equal to K; m^tRepresenting the number of documents contained in the t-th year;

calculating the stability of the technical subject by the following calculation formula:

wherein the content of the first and second substances,

represents the stability of the kth technical subject in the t year;

indicating that all time periods before the t year under the kth technical subject contain the standard deviation of the number of documents;

an average value indicating the number of documents included in all time periods before the t year under the kth subject;

calculating the technical subject emerging degree, wherein the calculation formula is as follows:

wherein the content of the first and second substances,

represents the degree of new development in the t year under the k technical theme;

represents the summation of the publication years of the documents in all time periods before the t year under the kth technical subject;

representing the number of documents contained in all time periods before the t-th year;

the technical subject frontier degree is contribution degree + influence degree, and the calculation formula is as follows:

the contribution degree is the core paper share + the introduction paper share; core paper share is core paper number/total number of leading edge core papers; the share of the introduction paper is the number of introduction papers/the total number of leading edge introduction papers;

influence degree is core paper introduced frequency share + applied paper introduced frequency share; the core paper is divided into the guided frequency of the core paper/the guided frequency of the leading edge core paper; the introduction paper is introduced with a frequency share, i.e. introduction paper is introduced with a frequency/leading edge introduction paper is introduced with a frequency.

Further, the predicting the development trend of the technical subject by using the Logistic model comprises the following steps:

the Logistic model tracks the nonlinear variation trend of a time sequence, and the development rate changes of the slope fitting technology in different stages in the germination period, the growth period, the maturation period and the decline period are as follows:

wherein, y_tAnd t represents technical subject index variables and time variables, such as four measurement indexes of the strength, stability, emerging degree and frontier degree of the technical subject; b is the maximum saturation value that the curve can reach,

a represents the slope of the curve and,

and I represents a time node of concave-convex transition of the curve, and I is more than 0.

Further, the updating the hyper-parameters of the LDA topic model by using Gibbs sampling includes:

and estimating the document-topic probability distribution and the topic-term probability distribution by adopting Gibbs sampling, wherein the parameter estimation calculation formulas of the two distributions are as follows:

wherein the content of the first and second substances,

the number of times of occurrence of a topic k for the mth text in the document-topic distribution; alpha is alpha_kDirichlet prior probability distribution for document-topic;

the number of the corresponding term t of the kth topic in the topic-term is taken as the number of the term t; beta is a_tIs the Dirichlet prior probability distribution of topic-terms.

Further, the determining the number of the optimal technical topics by using the confusion degree comprises:

applying the confusion to the LDA model to generate a confusion curve, wherein the number of topics corresponding to the lowest point or inflection point of the confusion curve is the determined optimal number, and the calculation formula is as follows:

to achieve the above object, a second aspect of the present invention provides a technical subject multi-index calculating and trend predicting device, including:

the preprocessing module is used for acquiring a plurality of thesis text data and preprocessing key fields of the plurality of thesis text data; wherein the key field comprises: paper title, abstract and keywords;

the extraction module is used for carrying out multi-feature weight calculation according to the key fields after the preprocessing operation and carrying out technical subject extraction on the weighted multi-feature fields by adopting an LDA subject model;

the statistical module is used for designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme and counting the index values of the technical theme indexes; wherein the technical subject index comprises: strength, stability, emerging degree, frontier degree;

and the prediction module is used for predicting the development trend of the technical subjects by adopting a Logistic model based on the index values and carrying out comparative analysis among the technical subjects.

The technical subject multi-index calculation and trend prediction device solves the technical problems that in the current research thesis subject device, most of technical subjects in specific fields are extracted by adopting single abstract information, and the extracted technical subjects are inaccurate and correlated.

The invention has the beneficial effects that: the method is used for identifying the technical subject identification and the trend prediction in the specific research field, can more comprehensively and accurately disclose the hot research and development situation, provides a new visual angle for the theoretical research of the technical subject identification and the trend prediction, and also provides method reference and decision support for the planned strategic layout of scientific research personnel, enterprises and the like.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a method for calculating multiple indexes and predicting a trend of a technical subject according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an LDA graph model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a technical subject multi-index calculation and trend prediction apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a technical subject multi-index calculation and trend prediction method and apparatus according to an embodiment of the present invention with reference to the drawings, and first, a technical subject multi-index calculation and trend prediction method according to an embodiment of the present invention will be described with reference to the drawings.

Fig. 1 is a flowchart of a method for calculating multiple indexes and predicting a trend according to an embodiment of the present invention.

As shown in fig. 1, the method for calculating multiple indexes and predicting a trend of a technical subject includes the following steps:

step S1, acquiring a plurality of paper text data, and preprocessing key fields of the plurality of paper text data; wherein, the key field includes: paper title, abstract and keywords.

Specifically, the invention establishes a corpus by taking academic paper data in the field of artificial intelligence as research objects, including experimental data of titles, abstracts, keywords, authors, institutions, publication time, quoted amount and the like of the papers. The method aims to explore the research frontier and the future development situation of the research frontier in the field of artificial intelligence and provide reference for other research frontier analysis work.

The title, abstract and keyword content of the paper characterize the technical subject information related to the specific field of the paper, and the technical information of the paper characterized by the semi-structured documents is the focus of research. Before inputting text data into an LDA topic model, the title, abstract and key words of a thesis need to be preprocessed respectively to form basic data for topic identification in the text. The data preprocessing work mainly comprises the steps of pure text format conversion of the acquired text data, punctuation mark removal, digital elimination, word segmentation processing, word filtering stop and the like, and provides good experimental data for subsequent analysis.

And step S2, performing multi-feature weight calculation according to the key fields after the preprocessing operation, and performing technical subject extraction on the weighted multi-feature fields by adopting an LDA subject model.

Specifically, assume that a preprocessed text corpus D containing M texts is to be generated, and a set D ═ D of the M texts₁,d₂,…,d_MIs covered by 3 characteristic fields T₁，T₂，T₃Can be expressed as a matrix of M × 3 characteristic fields (table 1), dm ═ Tm1, Tm2, Tm3)，1≤m≤M，d∈D。T₁，T₂，T₃Respectively, title, abstract and keyword fields of the paper, wherein T₁，T₂，T₃Comprising a plurality of terms.

TABLE 1 text set feature matrix

Because the technical subject importance degrees and the representing abilities contained in different feature fields of the thesis are different, the embodiment of the invention constructs the privilege weight of the text for the corpus, so that the identification difficulty of the subject can be effectively reduced, the interpretability of the subject is increased, the division of the subject is clearer, and the generalization ability of the subject identification is improved. The weight synthesis of the characteristic field reflects the contribution degree and the distinguishing capability of the characteristic to the text content, and the calculation of the characteristic field weight has great influence on the text theme extraction effect. The corpus that finally constitutes the LDA model is D ═ aT1, bT2, cT3, and a, b, c are the weights of the paper title, abstract and keyword, respectively, and the sum of the feature field weights is 1. The stronger the correlation between the feature fields, i.e. the better the feature expression. Therefore, it is necessary to increase the weight of the feature field with strong expression ability and decrease the weight of the feature field with poor expression ability. And measuring the weight of the characteristic fields by adopting the similarity degree between the characteristic fields, and calculating the cosine similarity between the characteristics according to the TF-IDF thought, wherein the relative importance calculation formula of the characteristic fields is as follows:

in the formula (I), the compound is shown in the specification,the weight of a feature field is proportional to the relevance of the feature field. n is_mThe number of terms contained in the document m is shown; n is_miFor a document m field T_miTraversing the average value of the occurrence times of all terms; mu.s_mjThe calculation formula is the same, i, j is 1,2, 3.

Further, in step S2, an LDA topic model is used to perform technical topic extraction on the weighted multi-feature fields.

It can be understood that the LDA model is an unsupervised machine learning method, can be used for identifying hidden topic information in a large-scale document set or corpus, and is currently widely applied in the fields of text mining and the like. The LDA model is a three-layer variable parameter level Bayes model, and each document is assumed to be formed by mixing a plurality of subjects, and each subject is formed by probability distribution of a plurality of words. The topic discovery of the text and the generation of the text are a reciprocal process, the topic discovery of the text is to discover topics described by terms from the text and associations among different topics, and the same text can have a plurality of topics; the text is generated by selecting terms matched with the theme from the word stock according to the theme to form the text. The paths of the two are opposite, and the LDA model realizes a solution of the text generation inverse problem through the prior parameters.

The core problem when the LDA model is adopted to carry out document theme modeling is to estimate the probability distribution condition of hidden variables, namely to obtain the hidden theme distribution in a target document and the term distribution of each hidden theme, which is helpful to further analyze the development trend of technical themes. The content of the text set D weighted by the characteristic field is related to K subjects (K is more than or equal to 1 and less than or equal to K), the term used by each text is from the same set containing V elements (V is an independent and identically distributed different repeated word in a word bag), and each document m contains N_m(1≤n≤N_m) Individual terms (text may be repeated words). A document generation step is as follows:

(1) according to a prior probability P (d)_m) Selecting a document d_m。

(2) Sampling from Dirichlet distribution (Dirichlet) alpha to generate document d_mSubject distribution of

I.e. distribution of topics

Generated from a Dirichlet distribution with a hyper-parameter a. The dimensionality of the text-to-topic distribution is M × K, as shown in table 2.

TABLE 2 text-topic distribution

(3) Polynomial distribution from topics

Sampling to generate document d_mTopic z of nth word_m,nThe topic with the highest probability value is selected according to the distribution.

(4) Sampling from Dirichlet distribution (Dirichlet) beta generates topic z_m,nCorresponding distribution of terms

I.e. distribution of terms

Generated from a Dirichlet distribution with a hyper-parameter of beta. The dimensionality of the distribution of topic-terms is K N_mAs shown in table 3.

TABLE 3 topic-term distributions

(5) From polynomial distribution of words

Final generation term w by intermediate sampling_m,nThe term with the highest probability value is selected according to the distribution.

(6) The above process is iterated until a text set is generated.

The LDA topic model considers a document as a collection of keywords, and does not consider any grammatical or word occurrence order relationship in the process of constructing the topic model, and a bayesian network diagram of the document is generated by using the model as shown in fig. 2.

In FIG. 2, the rectangles represent the scope of the in-frame dependent variables m, n, k; the values of the double circles represent known quantities that can be observed experimentally; the hyperparameter alpha is a parameter of a prior distribution Dirichlet distribution of a topic distribution of each document, beta is a parameter of a prior distribution Dirichlet distribution of a term distribution of each topic, and alpha and beta have smoothing effects on multiple parameters in the Dirichlet distribution. The value of the hyper-parameter threshold affects the acquisition of a plurality of terms distributed under each topic, and finally affects the accuracy of the algorithm, wherein the optimal value is alpha-0.5 and beta-0.01 according to experience. The random variable theta represents a theme distribution vector in the target document and a parameter to be estimated

The random variable phi represents a distribution vector of terms corresponding to a target subject, and a parameter to be estimated

Implicit variable z_m,nA topic vector representing that the target document m is distributed on the nth word is used for reflecting the potential relation between the document and the term; w is a_m,nRepresenting the nth word in the mth document.

The LDA generation of the text set is mainly divided into two steps,

is shown as passing through

Document-topic Dirichlet prior probability solution

A posterior probability process;

is shown as passing through

Topic-term Dirichlet prior probability solution

And (4) posterior probability process. Can be based on a large number of known document-term information P (w)_n|d_m) Training out a document-topic P (z)_k|d_m) And topic-term P (w)_n|z_k) The following formula:

therefore, the generation probability of each word in the obtained document is:

for a given corpus, the LDA topic modeling process is to estimate P (w)_n|z_k) And P (z)_k|d_m) The parameter (c) of (c). Due to P (d)_m) Is a document d_mIs the document d_mThe continuous product of the occurrence probabilities of all the terms in the dictionary can be calculated in advance. And P (w)_n|z_k) And P (z)_k|d_m) As latent variables are not available by direct calculation, a common method is parameter estimation by Gibbs sampling (Gibbs). Gibbs sampling is an algorithm used in Markov Chain Monte Carlo (MCMC) to obtain a series of observation samples approximately equal to a specified multidimensional probability distribution, and can eliminate the influence of prior parameters on the result. And estimating the document-theme probability distribution and the theme-term probability distribution by adopting Gibbs sampling, wherein the parameter estimation calculation formulas of the two distributions are as follows:

in the formula (I), the compound is shown in the specification,

P(d_m,w_n) The core of the mathematical principle of the LDA model is that the information found by the text subject is concentrated in the two probability distributions. However, the result calculated by the formula has uncertainty, because the formula contains a priori parameters, and therefore continuous iteration is needed to obtain stable text-topic distribution and topic-term distribution, and the two distribution updating rules are obtained through Gibbs sampling, so that the LDA model is determined.

Further, an LDA model input text is obtained after text data are preprocessed, characteristic weighting is carried out on key fields, an LDA topic model is used for modeling, and LDA model parameters are updated through Gibbs sampling. Although the LDA model is constructed, the number of technical topics in the specific field of the thesis cannot be directly determined by the model, and the distribution of the extracted topics is greatly influenced by the number of topics. If the granularity of the selected theme is too large, a plurality of themes without obvious semantic information classification can be generated, and some theme details cannot be effectively concerned; if the number of the subjects is too large, excessive separation of the subject information is easily caused. Therefore, how to scientifically determine the number of technical subjects is the key of research.

The embodiment of the invention adopts the perplexity to determine the optimal number of the topics, overcomes subjectivity and anthropogenic property, and can objectively and consistently represent the implicit topics actually contained in the analysis object. The confusion degree is used for judging the quality degree of a probability distribution or probability model prediction sample, generally, a descending rule is presented along with the increase of the number of potential subjects, and the smaller the confusion degree value is, the better the generating capability of the subject model is, otherwise, the weaker the generating capability is. The number of topics corresponding to the lowest point or inflection point of the confusion curve is the determined optimal number of topics, and the calculation formula applied to the LDA model is as follows:

step S3, designing a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and counting the index values of the technical theme indexes; wherein, the technical subject index comprises: strength, stability, emerging degree, frontier degree;

specifically, on the basis of the identified document research technical theme, four index calculation formulas of the strength, the stability, the emerging degree and the frontier degree of the technical theme are designed, and the technical theme index of each year is counted, so that the evolution trend of the technical theme is compared more visually.

(1) Subject matter intensity

The strength of a technical subject (strength) describes the hot degree of the technical subject, and the more documents related to a certain subject at a certain moment, the higher the strength of the subject is illustrated, and the higher the research on the technical subject is illustrated. The calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

represents the intensity of the kth year under the kth technical subject;

indicating the number of documents contained in the kth year under the kth technical subject,1≤k≤K；M^tindicating the number of documents contained in the t-th year.

(2) Stability of technical subject

The stability of a technical theme describes the development and fluctuation of a technical theme over a period of time. If the development of the technical theme is slow or the fluctuation is small, the stability is relatively good; if the technical theme develops violently or fluctuates greatly, the stability is poor. The specific definition is as follows:

in the formula (I), the compound is shown in the specification,

represents the stability of the kth technical subject in the t year;

it means that all the time periods before the t year under the kth technical subject contain the average value of the number of documents.

(3) Technical subject innovation

The technical theme emerging (emerging) describes whether technical themes contain new research content. If the year of a document in a technical theme is newer, the technical theme has higher emerging degree, and the research and application time of the technical theme is newer.

In the formula (I), the compound is shown in the specification,

indicating the number of documents contained in all time periods prior to the t-year.

(4) Frontier of technical subject

The frontier of the technical theme is contribution + influence.

The technical subject contribution degree is a relative share of a technical subject to the number of papers contributing to the research front, and includes a share of core papers contained in the technical subject in all core papers in the front and a share of introduction papers in all introduction papers in the front, and the specific calculation method is as follows: the contribution degree is the core paper share + the introduction paper share; core paper share is core paper number/total number of leading edge core papers; the share of the introduction paper is the number of introduction papers/total number of leading edge introduction papers.

The technical subject influence degree is the relative share of the guided frequencies of the papers contributed by the technical subject to the research frontier, and comprises the share of the guided frequencies of the core papers contained in the technical subject in the guided frequencies of all the core papers in the frontier and the share of the guided frequencies of the guided papers in the leading edge in all the guided papers, wherein the specific calculation method comprises the following steps: influence degree is core paper introduced frequency share + applied paper introduced frequency share; the core paper is divided into the guided frequency of the core paper/the guided frequency of the leading edge core paper; the introduction paper is introduced with a frequency share, i.e. introduction paper is introduced with a frequency/leading edge introduction paper is introduced with a frequency.

And step S4, predicting the development trend of the technical subjects by adopting a Logistic model, and carrying out comparative analysis on the technical subjects.

Specifically, the development trend of the technical theme and the law of subject evolution and development are predicted by observing the change conditions of the four indexes of the strength, the stability, the emerging degree and the frontier degree of the technical theme on a time axis, so that the future development of the technology can be controlled macroscopically to a certain extent, and guidance is provided for the current technical innovation. In order to display the future development situation of the technical subject, the development trend of the technical subject is predicted by adopting a Logistic model. The Logistic model belongs to an S-shaped curve, can track the nonlinear change trend of a time sequence, and the slopes of different stages can be more accurately fit with the development rate changes of the technology in the germination stage, the growth stage, the maturation stage and the decline stage. The Logistic model is widely applied to the technical trend prediction research field due to its simple form and excellent performance. The specific calculation formula is as follows:

in the formula, y_tAnd t represents technical subject index variables and time variables, such as four measurement indexes of the strength, stability, emerging degree and frontier degree of the technical subject; b is the maximum saturation value that the curve can reach,

a represents the slope of the curve and,

i denotes the time node of the concave-convex transition of the curve, I>0。

Through the steps, the technical theme is extracted by adopting the improved LDA theme model, the Logistic model is adopted to perform curve fitting on the technical theme based on various designed technical theme index measurement formulas, a universal technical theme multi-index calculation and trend prediction method is formed, and technical theme related indexes in a large amount of thesis text data can be uniformly and accurately calculated. The method is used for identifying technical subject identification and trend prediction in a specific research field so as to more comprehensively and accurately disclose hot research and development situation, provide a new visual angle for the theoretical research of the technical subject identification and the trend prediction and provide method reference and decision support for planned strategic layout of scientific research personnel, enterprises and the like.

The technical subject is subjected to subject identification and trend prediction based on the LDA subject model and the Logistic model, and there are many places which can be improved, but no matter how the specific improvement measures are, the problems in the prior art can be solved and the corresponding effect can be obtained as long as the improvement measures can further and accurately calculate the relevant indexes of the technical subject in the thesis text data and carry out the trend prediction of the technical subject.

In order to implement the foregoing embodiments, the present embodiment further provides a technical subject multi-index calculating and trend predicting apparatus 10, as shown in fig. 3, the apparatus 10 includes: a pre-processing module 100, an extraction module 200, a statistics module 300, and a prediction module 400.

The preprocessing module 100 is configured to acquire a plurality of thesis text data and perform preprocessing operation on key fields of the plurality of thesis text data; wherein, the key field includes: paper title, abstract and keywords;

the extraction module 200 is configured to perform multi-feature weight calculation according to the key fields after the preprocessing operation, and perform technical topic extraction on the weighted multi-feature fields by using an LDA topic model;

a statistic module 300, configured to design a plurality of index measurement methods for measuring the advancement of the technical theme based on the extracted technical theme, and count index values of technical theme indexes; wherein, the technical subject index comprises: strength, stability, emerging degree, frontier degree;

and the prediction module 400 is used for predicting the development trend of the technical subjects by adopting a Logistic model based on the index values and performing comparative analysis among the technical subjects.

It should be noted that the foregoing explanation of the embodiment of the method for calculating multiple indexes and predicting a trend of a technical subject is also applicable to the device for calculating multiple indexes and predicting a trend of a technical subject of the embodiment, and is not repeated herein.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A technical subject multi-index calculation and trend prediction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the preprocessing comprises:

and performing multiple types of pure text format conversion, punctuation mark removal, number elimination, word segmentation processing and stop word filtering on the acquired paper text data.

3. The method of claim 1, further comprising:

and updating the hyper-parameters of the LDA theme model by adopting Gibbs sampling, and determining the number of the optimal technical themes by adopting the confusion degree.

4. The method of claim 1, wherein the multi-feature weight calculation comprises:

measuring the weight of the multi-feature fields by adopting the similarity degree between the multi-feature fields, and calculating the cosine similarity between the multi-feature fields; wherein the relative importance degree calculation formula of the multi-feature field is as follows:

wherein, DuteThe weight of the token field is proportional to the relevance of the multi-token field, n_mThe number of terms contained in the document m is shown; n is_miFor a document m field T_miTraversing the average value of the occurrence times of all terms; mu.s_mjThe calculation formula is the same, i, j is 1,2, 3.

5. The technical subject multi-index calculation and trend prediction method of claim 1, wherein the technical subject extraction of the weighted multi-feature field by using the LDA subject model comprises:

according to a prior probability P (d)_m) Selecting a document d_m；

Sampling from Dirichlet distribution (Dirichlet) alpha to generate the document d_mSubject distribution theta of_m；

Distributing theta from the theme_mSampling to generate the document d_mTopic z of nth word_m,nSelecting the theme with the maximum probability value;

sampling from a Dirichlet distribution (Dirichlet) β to generate the topic z_m,nCorresponding distribution of terms

From the distribution of terms

Generating terms w by mid-sampling_m,nSelecting the term with the maximum probability value;

the above process is iterated until a text set is generated.

6. The method of claim 1, wherein the counting the index values of the technical subject indexes comprises:

wherein the content of the first and second substances,

represents the intensity of the kth year under the kth technical subject;

wherein the content of the first and second substances,

represents the stability of the kth technical subject in the t year;

wherein the content of the first and second substances,

represents the k < th >The technical theme of the new development year t;

7. The technical subject multi-index calculation and trend prediction method according to claim 1, wherein predicting the technical subject development trend by using a Logistic model comprises:

a represents the slope of the curve and,

i denotes the time node of the concave-convex transition of the curve, I>0。

8. The technical subject multi-index calculation and trend prediction method of claim 3, wherein the updating the hyper-parameters of the LDA subject model by Gibbs sampling comprises:

wherein the content of the first and second substances,

9. The method for multi-index calculation and trend prediction of technical topics according to claim 3, wherein the determining the number of optimal technical topics by using the confusion comprises:

。

10. a technical subject multi-index calculation and trend prediction device is characterized by comprising the following components: