CN112463928A

CN112463928A - Technical list generation method and system for field evaluation prediction

Info

Publication number: CN112463928A
Application number: CN202011434352.XA
Authority: CN
Inventors: 毛彬; 罗威; 谭玉珊; 罗准辰; 武帅; 钱旭; 田昌海; 叶宇铭; 宋宇; 胡明昊
Original assignee: Military Science Information Research Center Of Military Academy Of Chinese Pla
Current assignee: MILITARY SCIENCE INFORMATION RESEARCH CENTER OF MILITARY ACADEMY OF THE CHINESE PLA
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-09

Abstract

The invention discloses a field-oriented evaluation prediction technology list generation method and a field-oriented evaluation prediction technology list generation system, wherein the method comprises the following steps: extracting and identifying technical nouns of the massive scientific and technological information texts to obtain a mapping corpus; classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus; performing word frequency aggregation statistics on technical nouns for the domain mapping corpus, and extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list; respectively calculating an emerging degree index and a maturity degree index of technical nouns of the field high-frequency technical noun word list to obtain a field initial selection list; sequencing the initial field selection lists by adopting a pre-trained sequencing model, and extracting a plurality of previous initial field selection lists; performing information completion on the initial list of the field based on the open source knowledge base to obtain a detailed list of the field; and inputting the detailed field list into a pre-trained technical two-class classification model for technical judgment, and further filtering to obtain a technical field list.

Description

Technical list generation method and system for field evaluation prediction

Technical Field

The invention relates to the field of computer linguistics, relates to the field of computer natural language processing, and particularly relates to a field-oriented evaluation prediction technology list generation method and system.

Background

Emerging technologies are a source of technological innovation. In the field of national defense, games among large countries are fierce day by day, the chance is vanished in a short time, and the development of a new technology has great influence on breaking strategic attack and defense balance and subverting military technical thinking. Since emerging technologies have a high degree of market and technology uncertainty, it is quite difficult to identify them early. The traditional emerging technology early identification mainly depends on expert intelligence, needs to widely mobilize the power of experts to investigate, has huge workload, can only be usually aimed at fewer technical fields, is limited by factors of expert professional literacy, insight capability and prejudice, and is difficult to evaluate the accuracy. The value of scientific and technological information big data is fully excavated, emerging technical clues are found in time, relevant characteristics are scientifically evaluated, the efficiency of emerging technology identification can be effectively improved, human-computer combination is better realized in the aspects of clearly seeing directions and clearly seeing roads, and the method has important practical significance.

Learn-to-rank is a supervised learning method. For a given query-document pair (query document pair), extracting corresponding features, acquiring a document set and a real sequence under the given query, and then obtaining a sequencing model through various algorithms of learning-to-rank so that an output sequence is similar to the real sequence as much as possible. SVMrank is a learning-to-rank algorithm of pairwise, which is implemented by converting a ranking problem into a classification problem, and then learning and solving by using a svm classification model. The method of the present invention is a method of training a document based on a principle of a fuzzy algorithm, wherein the method of the present invention considers the relative relevance between two documents under a given query.

The Tagme algorithm specifically realizes the idea: constructing an anchor point data set according to the word link relation in Wikipedia, and calculating the correlation between terms based on the context co-occurrence condition; and constructing an anchor candidate set by performing anchor analysis on the input text, calculating the overall relevance of the candidate link entities, and selecting the candidate link entity set with the maximum overall relevance as a final entity link result.

Fasttex is a word vector calculation and text classification tool developed by Facebook in 2016, and is a rapid text classification tool, wherein the word vector and n-gram vector of the whole document are superposed and averaged to obtain a document vector, and then the document vector is used for softmax multi-classification.

Bi-LSTM-CRF is a natural language sequence labeling algorithm, can be used for entity recognition, is developed in an expanded two-way LSTM of an LSTM (long-short memory model), and aims to further solve the special condition of ambiguity in sequence labeling by combining with CRF (conditional random field).

According To the emerging degree (emergent Score) algorithm, aggregation statistics is carried out on a specific term object in a time dimension by defining two intervals and one admission condition, three change rates of an Active Period Trend (Active Period Trend), a Recent Trend (Recent Trend) and a medium-To-Recent change rate (Mid-mean To Last mean Slope) are respectively obtained, and further an emerging degree value of the term object is obtained. Wherein, the two intervals are respectively a base period (base period) and an active period (active period), the base period is usually defined as the first 3 years, and the active period is defined as the last 7 years; admission conditions the candidate set of terms is initially screened, comprising 1) at least 3 years of time span of occurrence, 2) at least 7 occurrences, 3) the frequency ratio of active periods to basal periods is at least 2:1, 4) the proportion of total frequency in basal periods cannot exceed 15%. The specific calculation method is as follows:

EScore 2 × activeperiodthreshold + (RecentTrend + MidYearToLastYearSlope), wherein recordedcount_iRepresents the number of records in year i, activeperiodtrack represents the active phase trend, RecentTrend represents the recent trend, and midyeartolalastyersslope represents the medium to recent rate of change.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a field-oriented evaluation prediction technology list generation method and system aiming at the field technology evaluation prediction list generation, automatically generates the list based on data drive, assists researchers to perform further technology evaluation and prediction, and provides support for field technology layout and decision. The method is suitable for automatically generating a technical list for realizing field technical evaluation prediction;

in order to achieve the above object, the present invention provides a domain-oriented method for generating a list of estimation and prediction techniques, the method comprising:

step 1) extracting and identifying technical nouns of a mass scientific and technical information text to obtain a mapping corpus;

step 2) classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;

step 3) carrying out word frequency aggregation statistics on the technical nouns on the domain mapping corpus, and extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list;

step 4) calculating an emerging degree index and a maturity degree index for each technical noun in the field high-frequency technical noun word list respectively to obtain a field initial selection list;

step 5) sorting the initial field selection lists by adopting a pre-trained sorting model, and extracting a plurality of the previous initial field selection lists;

step 6) performing information completion on the initial list of the field based on the open source knowledge base to obtain a detailed list of the field;

and 7) inputting the domain detailed list into a pre-trained technical secondary classification model for technical judgment, and further filtering by combining a rule matching method to obtain a domain technical list.

As an improvement of the above method, the step 1) specifically includes:

step 1-1) carrying out noun phrase identification on a mass scientific and technical information text, then carrying out entity linkage by adopting a Tagme algorithm, identifying an entity word set after normalization in the scientific and technical information text, and carrying out associated mapping with the text;

step 1-2) matching and extracting a mass scientific and technical information text according to a pre-accumulated technical word list to obtain a technical word set, and performing associated mapping on the technical word set and the text;

step 1-3) identifying technical nouns in the scientific and technical text according to a pre-trained technical noun identification model to obtain a technical noun set, and performing associated mapping with the text;

and 1-4) giving different word frequency weights to the entity word set, the technical word set and the technical name word set according to the credibility to obtain a mapping corpus set.

As an improvement of the above method, the step 4) specifically includes:

step 4-1), for each technical noun w in the domain high-frequency technical noun word list, counting annual word frequency Count of nearly 10 years in the domain mapping corpus set_w＝[c₁,c₂,...,c_i,...,c₁₀]，c_iFor the annual word frequency of the ith year, the sequence Exist of whether the technical noun appears or not is calculated by the following formula_wComprises the following steps:

Exist_w＝[e₁,e₂,...,e_i,...,e₁₀],e_i＝1 if c_i＞0 else 0

wherein e is_iRepresents whether the technical noun appears in the ith year, the appearance is 1, otherwise, the appearance is 0;

calculating the total number of word frequencies Count of the technical noun in the basic period by the following formula_{w_base}Comprises the following steps:

Count_{w_base}＝∑_i＝1,2,3c_i

calculating the total word frequency Count of the technical noun in the active period by the following formula_{w_active}Comprises the following steps:

Count_{w_active}＝∑_i＝4..10c_i

determine when the term is in Exist_wMore than 3 times in, and in Count_wMore than 7 times and Count_{w_active}/Count_{w_base}> 2 and Count_{w_active}/∑Count_wIf < 0.15, whether the value Escore is the new value Escore of the technical noun is calculated by the following formula_wElse, the emerging value Escore of the terminology_wIs minus infinity:

Escore_w＝2*APT+(RT+MYS)

wherein APT is the active period trend of the terminology:

the recent trend RT of this terminology is calculated from the following formula:

when c is going to₇-c₄When it is 0, let c₇-c₄1, the medium to near term rate of change MYS of the term is calculated from:

step 4-2) calculating whether the annual growth sequence Rate of the term is increased or not from the following formula_wComprises the following steps:

Rate_w＝[r₁,r₂,...,r_i,...,r₉],r_i＝1 if c_i+1-c_i＞0 else-1；

wherein r is_iRepresenting whether the growth is increased from the ith year to the i +1 year, if the growth is 1, otherwise, the growth is-1;

the Maturity value of the term is calculated from the following formula_wComprises the following steps:

Maturity_w＝[5+∑Rate_w/3]∈{6,7,8},if∑Rate_w＞2

Maturity_w＝[10-δ/∑c_i]∈{9,10},if∑r_i≤2,∑c_i≥δ

Maturity_w＝[0.5+5*(∑c_i)/δ]∈{1,2,...,5},if∑r_i≤2,∑c_i＜δ

wherein, the delta is an empirical threshold value, and the value is one percent of the total text number of the last decade in the field;

step 4-3) according to the emerging value Escore of the technical noun_wAnd Maturity value Maturity_wThen adding the index values of the total word frequency, the word frequency of the last three years, the ratio of the word frequency of the last three years to the total word frequency, and the ratio of the total word frequency to the field word frequency,obtaining an Indicator of the technical noun w_w＝[Escore_w,Maturity_w,AllCount_w,...]And further obtaining a field initial selection list corresponding to the field high-frequency technical noun word list.

As an improvement of the above method, the step 6) is specifically:

based on an open source knowledge base, semantic information of the knowledge base is matched through technical nouns of a field initial list, Chinese names, English names, explanation information and belonging category information are extracted, technical words are completed through information, and a field detailed list is obtained.

As an improvement of the above method, the method further includes a training step of a technical noun recognition model, specifically:

extracting a sentence set containing technical words in a technical word list from a massive technical information text according to a pre-accumulated technical word list;

using [ B-tech, I-tech, O ] labels as sentence sets for sequence marking; b-tech represents the first character of the technical word, I-tech represents other characters which are not the first character in the technical word, and O represents other characters or punctuation marks of the non-technical word;

and training by adopting a Bi-LSTM-CRF algorithm to obtain a technical noun recognition model.

As an improvement of the above method, the method further includes a training step of a domain classification model, specifically:

randomly sampling the existing mapping corpus;

extracting keywords aiming at the domain characteristics by a domain expert to retrieve the mapping corpus of random sampling to obtain a training set, thereby constructing a domain training corpus;

manually marking by a domain expert to obtain a label, marking whether each information text in the domain training corpus is a domain information text, and constructing a classified training corpus;

and training by adopting a fasttext classification algorithm to obtain a field classification model.

As an improvement of the above method, the method further includes a training step of the ranking model, specifically:

randomly sampling an existing field primary selection list to obtain a training set;

the method comprises the following steps that a domain expert scores a sequencing training set according to relevance and importance, a negative experience value is used for replacing a new emerging index with a negative infinite value, a sequencing result is obtained through weighted average, and a sequencing training corpus is constructed;

and training by adopting an SVMrank ordering algorithm to obtain an ordering model.

As an improvement of the above method, the method further comprises: the step of training the technology two classification model specifically comprises the following steps:

constructing a noun list containing partial technical nouns based on the open source knowledge base;

marking each record of the existing detailed list as a training corpus according to whether the record is a technical noun or not;

and training by adopting a fasttext classification algorithm to obtain a technical two-classification model.

A domain-oriented evaluation prediction technique manifest generation system, the system comprising: the trained technical noun recognition model, the field classification model, the sequencing model, the technical secondary classification model, the technical noun recognition extraction module, the field modeling module, the index evaluation module, the sequencing module, the semantic completion module and the secondary cleaning module; wherein the content of the first and second substances,

the technical noun identification and extraction module is used for extracting and identifying technical nouns of massive scientific and technological information texts to obtain a mapping corpus;

the domain modeling module is used for classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;

the index evaluation module is used for carrying out word frequency aggregation statistics on technical nouns on the domain mapping corpus, extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list, and respectively calculating an emerging degree index and a maturity degree index for each technical noun in the domain high-frequency technical noun word list to obtain a domain initial selection list;

the sequencing module is used for sequencing the initial field selection lists by adopting a pre-trained sequencing model and extracting a plurality of the previous initial field selection lists;

the semantic completion module is used for completing the information of the initial domain list based on the open source knowledge base to obtain a detailed domain list;

the secondary cleaning module is used for inputting the detailed field list into a pre-trained technical secondary classification model for technical judgment, and further filtering the detailed field list by combining a rule matching method to obtain a technical field list

Compared with the prior art, the invention has the advantages that:

1. the invention provides a technical list automatic generation framework for data driving of a field-oriented technical evaluation prediction task;

2. the invention improves the intelligent content of the machine in the field technology evaluation and prediction activities, exerts the data advantage of big data, and changes the limitation that the traditional technology evaluation and prediction seriously depends on experts;

3. the invention provides a general processing flow, which has strong lateral expansibility and strong compatibility possibility of upgrading and optimizing, and can be strengthened by more advanced optimization strategies in each processing link;

4. compared with the traditional expert prediction list, the technical list generated by the invention has more objective index fact evidence and relatively abundant semantic information, thereby improving the scientificity and objectivity of the list.

Drawings

Fig. 1 is a flowchart of a domain-oriented evaluation prediction technique list generation method according to embodiment 1 of the present invention;

fig. 2 is a block diagram of a domain-oriented evaluation prediction technique list generation system according to embodiment 2 of the present invention.

Detailed Description

The method of the technical scheme of the invention has the following route:

step 1) extracting technical nouns from massive scientific and technical information textsIdentifying and constructing mapping corpus D_all；

Step 2) mapping corpus D_allPerforming domain modeling and constructing a domain mapping corpus D_domain；

Step 3) mapping corpus D to the field_domainMaking word frequency aggregation statistics of technical nouns, and taking TopN to obtain word List of high-frequency technical nouns in field_tech；

Step 4) word List of high-frequency technical nouns in field_techCalculating indexes such as 'emerging degree' and 'maturity' corresponding to each technical noun to obtain a field initial selection List_original；

Step 5) initially selecting the List of the fields_originalRandomly sampling to obtain List_trainOrder is obtained by manual marking of domain experts_trainUsing SVMrank algorithm to pair List_trainAnd Order_trainTraining to obtain a sequencing Model_SVMrank；

Step 6) utilizing a sequencing Model_SVMrankList for initial selection of fields_originalSorting, and taking topK as a field initial List_init；

Step 7) initial List of domains_initCompleting information based on open source knowledge base to obtain domain detailed List_detail；

Step 8) constructing a class detailed list FList containing positive and negative examples aiming at the technology based on the open source knowledge base_detailAdopting a fasttext classification algorithm to carry out on FList_detailTraining is carried out to obtain a technical secondary classification Model_classify；

Step 9) Model of technology-based secondary classification Model_classifyList of domain detail lists_detailPerforming technical judgment, and further filtering by combining a rule matching method to obtain a field technical List List_tech-domain。

The step 1) specifically comprises the following steps:

step 1-1) identifying all attributions in the technical information text by utilizing a tagme algorithm to link entities after identifying noun phrases of the massive technical information textSubsequent entity word set W_entityAnd performing associated mapping with the text;

step 1-2) matching the scientific and technological text to extract technical words contained in the scientific and technological text to obtain a technical word set W for the massive scientific and technological information text according to the accumulated technical word list_techAnd performing associated mapping with the text;

step 1-3) matching from massive scientific and technical information texts by utilizing the word list in the prior art to obtain a sentence set S containing technical words_techThen using [ B-tech, I-tech, O ]]Wait for the label to be a sentence set S_techPerforming sequence labeling, and training a technical noun recognition Model by using a Bi-LSTM-CRF algorithm_detect；

Step 1-4) identifying Model by using technical nouns_detectIdentifying technical nouns in scientific and technical texts to obtain a technical noun set W_tech-normAnd performing associated mapping with the text;

step 1-5) for entity word set W_entityTechnical word set W_techTechnical name word set W_tech-normGiving different word frequency weights according to the credibility to obtain a mapping corpus D_all；

The step 2) specifically comprises the following steps:

step 2-1) mapping corpus D_allRandomly sampling to obtain training set 1D_train1Extracting the keyword pair mapping corpus D by the domain expert aiming at the domain characteristics_allSearch to obtain training set 2D_train2Thereby constructing a domain training corpus D_train；

Step 2-2) obtaining tag through manual marking of domain experts_trainMarking whether each information text in the field training corpus is a field information text, constructing a classification training corpus, and adopting a classification fasttext classification algorithm to perform D_trainAnd tag_trainTraining is carried out to obtain a domain classification Model_{domain-detect}；

Step 2-3) utilizing a domain classification Model_{domain-detect}Classifying the scientific and technological information texts to obtain a domain mapping corpus D_domain；

The step 4) specifically comprises the following steps:

step 4-1) of List of high-frequency technical noun words of the domain_techWord w in (1), mapping corpus D in the domain_domainCounting annual word frequency of nearly 10 years to obtain Count_w＝[c₁,c₂,...,c₁₀]The emerging value Escore of the word is obtained by the following calculation process_w

Exist_w＝[e₁,e₂,...,e₁₀],e_i＝1 if c_i＞0 else 0

Count_{w_base}＝∑_i＝1,2,3c_i

Count_{w_active}＝∑_i＝4..10c_i

Suppose sigma Exist_w> 3 and ∑ Count_w> 7 and Count_{w_active}/Count_{w_base}> 2 and Count_{w_active}/∑Count_w< 0.15, calculated according to the following formula: otherwise, the emerging value of the word is minus infinity;

Escore_w＝2*APT+(RT+MYS)

wherein Exist_wA sequence representing the annual occurrence of the word, e_iRepresenting whether the word appears in year i, c_iRepresents the number of records, Count, that the word appeared in the ith year_{w_base}Represents the total word frequency, Count of the word in the basic period_{w_active}Represents the total word frequency of the word in the active period, APT represents the trend of the word in the active period, RT represents the recent trend of the word, and MYS represents the medium-term to recent change rate of the word. In order to make the above-mentionedThe new calculation formula has universality and needs two special processes: the word frequency replaces zero value with minimum value to ensure normal division, and 1 replaces c₇-c₄A zero value;

step 4-2) of the List of high-frequency technical noun words of the domain_techWord w in (1), mapping corpus D in the domain_domainCounting annual word frequency of nearly 10 years to obtain Count_w＝[c₁,c₂,...,c₁₀]The Maturity value Maturity of the word is calculated by the following calculation process_w

Rate_w＝[r₁,r₂,...,r₉],r_i＝1 if c_i+1-c_i＞0 else-1

Maturity_w＝[5+∑Rate_w/3]∈{6,7,8},if∑Rate_w＞2

Maturity_w＝[10-δ/∑c_i]∈{9,10},if∑r_i≤2,∑c_i≥δ

Maturity_w＝[0.5+5*(∑c_i)/δ]∈{1,2,...,5},if∑r_i≤2,∑c_i＜δ

Wherein, c_iThe word frequency number, Rate, of the word in the i-th year_wRepresenting whether the word has a sequence of annual growth or not, r_iRepresenting whether the year i to year i +1 has increased, δ is the empirical threshold, and the suggested value is one hundredth of the total text in the last decade of the field.

Step 4-3) calculating according to the step 4-1) and the step 4-2) to obtain an index value of each technical noun, and adding other index values such as total word frequency, near three-year word frequency, ratio of near three-year word frequency to total word frequency, ratio of total word frequency to field word frequency and the like to obtain an Indicator_w＝[Escore_w,Maturity_w,AllCount_w,...]Obtaining the List of the first selection of the domain_original；

The step 5) is specifically as follows:

list for initial selection of fields_originalRandom sampling to obtain training List_{original_train}Obtaining the ordering Order aiming at the training set through the manual marking of the domain expert_trainThe training data record Order is constructed according to the following format_w qid:w1:Escore_w 2:Maturity_w 3:AllCount_w 4:

5:

6:

... # tech _ norm adopts SVMrank sorting algorithm to train and obtain sorting Model_SVMrank。

The step 7) is specifically as follows:

based on open source knowledge base, through domain initial List_initThe technical nouns are matched with the rich semantic information of the knowledge base, information including Chinese names, English names, explanations and the like is extracted to complete the technical words, and a domain detailed List List is obtained_detailWherein, the open source knowledge base comprises Wikipedia, Baidu encyclopedia, Jian's defense and various information corpora and the like.

The step 8) is specifically as follows:

based on the open source knowledge base, construct a noun list NormList containing partial technical nouns, and use step 7 to obtain the detailed list FList for the noun list_detailIs a detailed list FList_detailWhether each record mark is a technical noun or not is used as a training corpus, and a technical two-classification Model is obtained by adopting fasttext classification algorithm training_classify；

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, embodiment 1 of the present invention discloses a domain-oriented method for generating a list of evaluation and prediction technologies, which specifically includes:

step 1) extracting and identifying technical nouns of massive scientific and technical text information to obtain mapping corpora, which is as follows:

step 1-1) training a tagme entity linking tool for Wikipedia data, performing noun phrase recognition on massive scientific and technological information, then performing entity linking to obtain normalized noun phrase entity words, and mapping the normalized noun phrase entity words with a scientific and technological text;

step 1-2) collecting technical nouns from emerging technical categories in Wikipedia to construct a technical word list, then matching and extracting the technical words existing in the technical word list from massive technical information texts, and mapping the technical words with the technical texts;

step 1-3) extracting a sentence set containing the technical words from the massive scientific and technological information text according to the technical word list, carrying out sequence labeling on the sentence set by using [ B-tech, I-tech, O ] labels, training a technical noun phrase recognition model by adopting a Bi-LSTM-CRF algorithm, extracting technical noun phrases in the massive scientific and technological information text, and mapping the technical noun phrases with the scientific and technological text;

step 1-4) assigning values to word frequency credibility of noun phrase entity words, technical words and technical noun phrases according to [0.8,1 and 0.4] respectively to form a uniform mapping corpus;

step 2) performing domain modeling on the mapping corpus, which comprises the following specific steps:

step 2-1) obtaining a mapping corpus training set by random sampling and domain keyword retrieval sampling of the mapping corpus, carrying out domain two-classification labeling on the mapping corpus training set by a domain expert, constructing a classification training corpus of an information text and a domain label, and training by adopting a fasttext classification algorithm to obtain a domain classification model;

step 2-2) classifying the scientific and technological information texts by using a domain classification model to obtain a domain mapping corpus;

step 3) performing index evaluation on the domain mapping corpus, which specifically comprises the following steps:

step 3-1), performing word frequency aggregation statistics on the technical nouns of the domain mapping corpus, and taking the top 1000 items as domain high-frequency technical nouns;

step 3-2) carrying out statistics on the word frequency sequence of the high-frequency technical nouns in the near ten years in a field prediction set, then calculating the emerging value and the maturity of the high-frequency technical nouns, and carrying out statistics on indexes such as total word frequency, near three year word frequency, ratio of near three year word frequency to total word frequency, ratio of total word frequency to field word frequency and the like to form a primary selection list, wherein part of results are shown in the following table:

step 4) sorting the primary selection list, which comprises the following specific steps:

step 4-1) randomly sampling the primary selection list to obtain a sequencing training set, scoring the sequencing training set according to relevance and importance through a field expert, obtaining a sequencing result after weighted average, and constructing a sequencing training corpus in the following format:

5 qid:1 1:38.13 2:2 3:14.6 4:12 5:0.8219 6:0.34#Phase(waves)

8 qid:2 1:29.87 2:1 3:6.7 4:6 5:0.8955 6:0.0016#Dielectric resonator antenna

the meaning is as follows:

phase (waves): the emerging degree index is 38.13, the maturity index is 2, the word frequency 1 is 14.6, the word frequency 2 is 12, the proportion 1 is 82.19%, the proportion 2 is 0.34%, and the ranking after weighting according to the scoring of experts is 5;

dielectric resonator antenna: the emerging index is 29.87, the maturity index is 1, the word frequency number 1 is 6.7, the word frequency number 2 is 6, the percentage 1 is 89.55%, the percentage 2 is 0.16%, and the ranking after the weighting according to the grading of experts is 8;

wherein, the emerging degree is negative infinity and is replaced by an experience great negative value, in the experiment, the user uses-9999, and trains the corpus by adopting SVMrank ordering algorithm to obtain an ordering model;

step 4-2) sequencing the primary selection list by utilizing a sequencing model, and taking the top 100 as a field initial list;

step 5) completing the information of the initial list of the fields, which comprises the following steps:

step 5-1) using wikidata, Baidu encyclopedia and other data to match technical nouns in the field initial list and collect relevant semantic information, such as Chinese names, explanations and belonging category information, and constructing detailed information records aiming at the technical nouns to form a field detailed list, for example, as follows:

"Edge computing, die casting, distributed computing, parallel computing, programmable mapping, mapping, Edge computing, means an open platform that integrates network, computing, storage, and application core capabilities near the object or data source to provide the nearest service, Edge computing to the nearest computing, storage, and application core capabilities"

Step 6) carrying out secondary cleaning on the detailed field list, which comprises the following specific steps:

step 6-1) taking 100 emerging technical nouns from the existing word list, simultaneously randomly extracting 200 entries from Wikipedia non-emerging technical categories, constructing a detailed list record according to the method of step 5), and marking the detailed list record according to whether the technology is adopted to form a two-classification training corpus;

step 6-2) training the two-classification training corpus by adopting a fasttext classification algorithm to obtain a technical two-classification model;

step 6-3) performing technical judgment on the field detailed list by adopting a technical secondary classification model to obtain a field technical list;

step 6-4) removing some categories which are obviously not technical according to the category information in the detailed list, wherein the categories are countries, organizations, figures, music albums and the like to obtain a final technical list of the field;

example 2

As shown in fig. 2, based on the above method, embodiment 2 of the present invention provides four trained models of a domain-oriented evaluation prediction technology list generation system, a technical noun recognition and extraction module, a domain modeling module, an index evaluation module, a ranking module, a semantic completion module, and a secondary cleaning module; wherein, the model includes: technical noun recognition model, domain classification model, sequencing model and binary classification model.

and the secondary cleaning module is used for inputting the detailed field list into a pre-trained technical secondary classification model for technical judgment, and further filtering by combining a rule matching method to obtain a technical field list.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A domain-oriented evaluation prediction technology list generation method comprises the following steps:

2. The domain-oriented assessment and prediction technology list generation method according to claim 1, wherein the step 1) specifically comprises:

3. The domain-oriented evaluation prediction technology list generation method according to claim 2, wherein the step 4) specifically comprises:

step 4-1) for each technical noun w in the domain high-frequency technical noun word list, counting the years of nearly 10 years of domain mapping corpus centralizationWord frequency Count_w＝[c₁,c₂,...,c_i,...,c₁₀]，c_iFor the annual word frequency of the ith year, the sequence Exist of whether the technical noun appears or not is calculated by the following formula_wComprises the following steps:

Exist_w＝[e₁,e₂,...,e_i,...,e₁₀],e_i＝1ifc_i＞0else0

Count_{w_base}＝∑_i＝1,2,3c_i

Count_{w_active}＝∑_i＝4..10c_i

determine when the term is in Exist_wMore than 3 times in, and in Count_wMore than 7 times and Count_{w_active}/Count_{w_base}> 2 and Count_{w_active}/∑Count_wWhen the value is less than 0.15, the emerging value Escore of the technical noun is calculated by the following formula_wElse, the emerging value Escore of the terminology_wIs minus infinity:

Escore_w＝2*APT+(RT+MYS)

wherein APT is the active period trend of the terminology:

Rate_w＝[r₁,r₂,...,r_i,...,r₉],r_i＝1ifc_i+1-c_i＞0else-1；

Maturity_w＝[5+∑Rate_w/3]∈{6,7,8},if∑Rate_w＞2

Maturity_w＝[10-δ/∑c_i]∈{9,10},if∑r_i≤2,∑c_i≥δ

Maturity_w＝[0.5+5*(∑c_i)/δ]∈{1,2,...,5},if∑r_i≤2,∑c_i＜δ

step 4-3) according to the emerging value Escore of the technical noun_wAnd Maturity value Maturity_wThen adding the index values of the total word frequency, the word frequency of nearly three years, the ratio of the word frequency of nearly three years to the total word frequency, and the ratio of the total word frequency to the field word frequency to obtain the index value of the technical noun w

Indicator_w＝[Escore_w,Maturity_w,AllCount_w,...]And further obtaining a field initial selection list corresponding to the field high-frequency technical noun word list.

4. The domain-oriented evaluation prediction technology list generation method according to claim 3, wherein the step 6) is specifically:

5. The method for generating a domain-oriented evaluation prediction technology list according to claim 2, further comprising a training step of a technical noun recognition model, specifically:

6. The domain-oriented evaluation prediction technology list generation method according to claim 1, further comprising a training step of a domain classification model, specifically:

randomly sampling the existing mapping corpus;

7. The domain-oriented evaluation prediction technology list generation method according to claim 1, further comprising a training step of a ranking model, specifically:

8. The domain-oriented assessment prediction technology manifest generation method of claim 6, further comprising: the step of training the technology two classification model specifically comprises the following steps:

9. A domain-oriented evaluation prediction technology inventory generation system, the system comprising: the trained technical noun recognition model, the field classification model, the sequencing model, the technical secondary classification model, the technical noun recognition extraction module, the field modeling module, the index evaluation module, the sequencing module, the semantic completion module and the secondary cleaning module; wherein the content of the first and second substances,