CN112463928A - Technical list generation method and system for field evaluation prediction - Google Patents

Technical list generation method and system for field evaluation prediction Download PDF

Info

Publication number
CN112463928A
CN112463928A CN202011434352.XA CN202011434352A CN112463928A CN 112463928 A CN112463928 A CN 112463928A CN 202011434352 A CN202011434352 A CN 202011434352A CN 112463928 A CN112463928 A CN 112463928A
Authority
CN
China
Prior art keywords
technical
domain
list
field
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011434352.XA
Other languages
Chinese (zh)
Inventor
毛彬
罗威
谭玉珊
罗准辰
武帅
钱旭
田昌海
叶宇铭
宋宇
胡明昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
MILITARY SCIENCE INFORMATION RESEARCH CENTER OF MILITARY ACADEMY OF THE CHINESE PLA
Original Assignee
Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Military Science Information Research Center Of Military Academy Of Chinese Pla filed Critical Military Science Information Research Center Of Military Academy Of Chinese Pla
Priority to CN202011434352.XA priority Critical patent/CN112463928A/en
Publication of CN112463928A publication Critical patent/CN112463928A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a field-oriented evaluation prediction technology list generation method and a field-oriented evaluation prediction technology list generation system, wherein the method comprises the following steps: extracting and identifying technical nouns of the massive scientific and technological information texts to obtain a mapping corpus; classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus; performing word frequency aggregation statistics on technical nouns for the domain mapping corpus, and extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list; respectively calculating an emerging degree index and a maturity degree index of technical nouns of the field high-frequency technical noun word list to obtain a field initial selection list; sequencing the initial field selection lists by adopting a pre-trained sequencing model, and extracting a plurality of previous initial field selection lists; performing information completion on the initial list of the field based on the open source knowledge base to obtain a detailed list of the field; and inputting the detailed field list into a pre-trained technical two-class classification model for technical judgment, and further filtering to obtain a technical field list.

Description

Technical list generation method and system for field evaluation prediction
Technical Field
The invention relates to the field of computer linguistics, relates to the field of computer natural language processing, and particularly relates to a field-oriented evaluation prediction technology list generation method and system.
Background
Emerging technologies are a source of technological innovation. In the field of national defense, games among large countries are fierce day by day, the chance is vanished in a short time, and the development of a new technology has great influence on breaking strategic attack and defense balance and subverting military technical thinking. Since emerging technologies have a high degree of market and technology uncertainty, it is quite difficult to identify them early. The traditional emerging technology early identification mainly depends on expert intelligence, needs to widely mobilize the power of experts to investigate, has huge workload, can only be usually aimed at fewer technical fields, is limited by factors of expert professional literacy, insight capability and prejudice, and is difficult to evaluate the accuracy. The value of scientific and technological information big data is fully excavated, emerging technical clues are found in time, relevant characteristics are scientifically evaluated, the efficiency of emerging technology identification can be effectively improved, human-computer combination is better realized in the aspects of clearly seeing directions and clearly seeing roads, and the method has important practical significance.
Learn-to-rank is a supervised learning method. For a given query-document pair (query document pair), extracting corresponding features, acquiring a document set and a real sequence under the given query, and then obtaining a sequencing model through various algorithms of learning-to-rank so that an output sequence is similar to the real sequence as much as possible. SVMrank is a learning-to-rank algorithm of pairwise, which is implemented by converting a ranking problem into a classification problem, and then learning and solving by using a svm classification model. The method of the present invention is a method of training a document based on a principle of a fuzzy algorithm, wherein the method of the present invention considers the relative relevance between two documents under a given query.
The Tagme algorithm specifically realizes the idea: constructing an anchor point data set according to the word link relation in Wikipedia, and calculating the correlation between terms based on the context co-occurrence condition; and constructing an anchor candidate set by performing anchor analysis on the input text, calculating the overall relevance of the candidate link entities, and selecting the candidate link entity set with the maximum overall relevance as a final entity link result.
Fasttex is a word vector calculation and text classification tool developed by Facebook in 2016, and is a rapid text classification tool, wherein the word vector and n-gram vector of the whole document are superposed and averaged to obtain a document vector, and then the document vector is used for softmax multi-classification.
Bi-LSTM-CRF is a natural language sequence labeling algorithm, can be used for entity recognition, is developed in an expanded two-way LSTM of an LSTM (long-short memory model), and aims to further solve the special condition of ambiguity in sequence labeling by combining with CRF (conditional random field).
According To the emerging degree (emergent Score) algorithm, aggregation statistics is carried out on a specific term object in a time dimension by defining two intervals and one admission condition, three change rates of an Active Period Trend (Active Period Trend), a Recent Trend (Recent Trend) and a medium-To-Recent change rate (Mid-mean To Last mean Slope) are respectively obtained, and further an emerging degree value of the term object is obtained. Wherein, the two intervals are respectively a base period (base period) and an active period (active period), the base period is usually defined as the first 3 years, and the active period is defined as the last 7 years; admission conditions the candidate set of terms is initially screened, comprising 1) at least 3 years of time span of occurrence, 2) at least 7 occurrences, 3) the frequency ratio of active periods to basal periods is at least 2:1, 4) the proportion of total frequency in basal periods cannot exceed 15%. The specific calculation method is as follows:
Figure BDA0002827666680000021
Figure BDA0002827666680000022
Figure BDA0002827666680000023
EScore 2 × activeperiodthreshold + (RecentTrend + MidYearToLastYearSlope), wherein recordedcountiRepresents the number of records in year i, activeperiodtrack represents the active phase trend, RecentTrend represents the recent trend, and midyeartolalastyersslope represents the medium to recent rate of change.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a field-oriented evaluation prediction technology list generation method and system aiming at the field technology evaluation prediction list generation, automatically generates the list based on data drive, assists researchers to perform further technology evaluation and prediction, and provides support for field technology layout and decision. The method is suitable for automatically generating a technical list for realizing field technical evaluation prediction;
in order to achieve the above object, the present invention provides a domain-oriented method for generating a list of estimation and prediction techniques, the method comprising:
step 1) extracting and identifying technical nouns of a mass scientific and technical information text to obtain a mapping corpus;
step 2) classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;
step 3) carrying out word frequency aggregation statistics on the technical nouns on the domain mapping corpus, and extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list;
step 4) calculating an emerging degree index and a maturity degree index for each technical noun in the field high-frequency technical noun word list respectively to obtain a field initial selection list;
step 5) sorting the initial field selection lists by adopting a pre-trained sorting model, and extracting a plurality of the previous initial field selection lists;
step 6) performing information completion on the initial list of the field based on the open source knowledge base to obtain a detailed list of the field;
and 7) inputting the domain detailed list into a pre-trained technical secondary classification model for technical judgment, and further filtering by combining a rule matching method to obtain a domain technical list.
As an improvement of the above method, the step 1) specifically includes:
step 1-1) carrying out noun phrase identification on a mass scientific and technical information text, then carrying out entity linkage by adopting a Tagme algorithm, identifying an entity word set after normalization in the scientific and technical information text, and carrying out associated mapping with the text;
step 1-2) matching and extracting a mass scientific and technical information text according to a pre-accumulated technical word list to obtain a technical word set, and performing associated mapping on the technical word set and the text;
step 1-3) identifying technical nouns in the scientific and technical text according to a pre-trained technical noun identification model to obtain a technical noun set, and performing associated mapping with the text;
and 1-4) giving different word frequency weights to the entity word set, the technical word set and the technical name word set according to the credibility to obtain a mapping corpus set.
As an improvement of the above method, the step 4) specifically includes:
step 4-1), for each technical noun w in the domain high-frequency technical noun word list, counting annual word frequency Count of nearly 10 years in the domain mapping corpus setw=[c1,c2,...,ci,...,c10],ciFor the annual word frequency of the ith year, the sequence Exist of whether the technical noun appears or not is calculated by the following formulawComprises the following steps:
Existw=[e1,e2,...,ei,...,e10],ei=1 if ci>0 else 0
wherein e isiRepresents whether the technical noun appears in the ith year, the appearance is 1, otherwise, the appearance is 0;
calculating the total number of word frequencies Count of the technical noun in the basic period by the following formulaw_baseComprises the following steps:
Countw_base=∑i=1,2,3ci
calculating the total word frequency Count of the technical noun in the active period by the following formulaw_activeComprises the following steps:
Countw_active=∑i=4..10ci
determine when the term is in ExistwMore than 3 times in, and in CountwMore than 7 times and Countw_active/Countw_base> 2 and Countw_active/∑CountwIf < 0.15, whether the value Escore is the new value Escore of the technical noun is calculated by the following formulawElse, the emerging value Escore of the terminologywIs minus infinity:
Escorew=2*APT+(RT+MYS)
wherein APT is the active period trend of the terminology:
Figure BDA0002827666680000041
the recent trend RT of this terminology is calculated from the following formula:
Figure BDA0002827666680000042
when c is going to7-c4When it is 0, let c7-c41, the medium to near term rate of change MYS of the term is calculated from:
Figure BDA0002827666680000043
step 4-2) calculating whether the annual growth sequence Rate of the term is increased or not from the following formulawComprises the following steps:
Ratew=[r1,r2,...,ri,...,r9],ri=1 if ci+1-ci>0 else-1;
wherein r isiRepresenting whether the growth is increased from the ith year to the i +1 year, if the growth is 1, otherwise, the growth is-1;
the Maturity value of the term is calculated from the following formulawComprises the following steps:
Maturityw=[5+∑Ratew/3]∈{6,7,8},if∑Ratew>2
Maturityw=[10-δ/∑ci]∈{9,10},if∑ri≤2,∑ci≥δ
Maturityw=[0.5+5*(∑ci)/δ]∈{1,2,...,5},if∑ri≤2,∑ci<δ
wherein, the delta is an empirical threshold value, and the value is one percent of the total text number of the last decade in the field;
step 4-3) according to the emerging value Escore of the technical nounwAnd Maturity value MaturitywThen adding the index values of the total word frequency, the word frequency of the last three years, the ratio of the word frequency of the last three years to the total word frequency, and the ratio of the total word frequency to the field word frequency,obtaining an Indicator of the technical noun ww=[Escorew,Maturityw,AllCountw,...]And further obtaining a field initial selection list corresponding to the field high-frequency technical noun word list.
As an improvement of the above method, the step 6) is specifically:
based on an open source knowledge base, semantic information of the knowledge base is matched through technical nouns of a field initial list, Chinese names, English names, explanation information and belonging category information are extracted, technical words are completed through information, and a field detailed list is obtained.
As an improvement of the above method, the method further includes a training step of a technical noun recognition model, specifically:
extracting a sentence set containing technical words in a technical word list from a massive technical information text according to a pre-accumulated technical word list;
using [ B-tech, I-tech, O ] labels as sentence sets for sequence marking; b-tech represents the first character of the technical word, I-tech represents other characters which are not the first character in the technical word, and O represents other characters or punctuation marks of the non-technical word;
and training by adopting a Bi-LSTM-CRF algorithm to obtain a technical noun recognition model.
As an improvement of the above method, the method further includes a training step of a domain classification model, specifically:
randomly sampling the existing mapping corpus;
extracting keywords aiming at the domain characteristics by a domain expert to retrieve the mapping corpus of random sampling to obtain a training set, thereby constructing a domain training corpus;
manually marking by a domain expert to obtain a label, marking whether each information text in the domain training corpus is a domain information text, and constructing a classified training corpus;
and training by adopting a fasttext classification algorithm to obtain a field classification model.
As an improvement of the above method, the method further includes a training step of the ranking model, specifically:
randomly sampling an existing field primary selection list to obtain a training set;
the method comprises the following steps that a domain expert scores a sequencing training set according to relevance and importance, a negative experience value is used for replacing a new emerging index with a negative infinite value, a sequencing result is obtained through weighted average, and a sequencing training corpus is constructed;
and training by adopting an SVMrank ordering algorithm to obtain an ordering model.
As an improvement of the above method, the method further comprises: the step of training the technology two classification model specifically comprises the following steps:
constructing a noun list containing partial technical nouns based on the open source knowledge base;
marking each record of the existing detailed list as a training corpus according to whether the record is a technical noun or not;
and training by adopting a fasttext classification algorithm to obtain a technical two-classification model.
A domain-oriented evaluation prediction technique manifest generation system, the system comprising: the trained technical noun recognition model, the field classification model, the sequencing model, the technical secondary classification model, the technical noun recognition extraction module, the field modeling module, the index evaluation module, the sequencing module, the semantic completion module and the secondary cleaning module; wherein the content of the first and second substances,
the technical noun identification and extraction module is used for extracting and identifying technical nouns of massive scientific and technological information texts to obtain a mapping corpus;
the domain modeling module is used for classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;
the index evaluation module is used for carrying out word frequency aggregation statistics on technical nouns on the domain mapping corpus, extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list, and respectively calculating an emerging degree index and a maturity degree index for each technical noun in the domain high-frequency technical noun word list to obtain a domain initial selection list;
the sequencing module is used for sequencing the initial field selection lists by adopting a pre-trained sequencing model and extracting a plurality of the previous initial field selection lists;
the semantic completion module is used for completing the information of the initial domain list based on the open source knowledge base to obtain a detailed domain list;
the secondary cleaning module is used for inputting the detailed field list into a pre-trained technical secondary classification model for technical judgment, and further filtering the detailed field list by combining a rule matching method to obtain a technical field list
Compared with the prior art, the invention has the advantages that:
1. the invention provides a technical list automatic generation framework for data driving of a field-oriented technical evaluation prediction task;
2. the invention improves the intelligent content of the machine in the field technology evaluation and prediction activities, exerts the data advantage of big data, and changes the limitation that the traditional technology evaluation and prediction seriously depends on experts;
3. the invention provides a general processing flow, which has strong lateral expansibility and strong compatibility possibility of upgrading and optimizing, and can be strengthened by more advanced optimization strategies in each processing link;
4. compared with the traditional expert prediction list, the technical list generated by the invention has more objective index fact evidence and relatively abundant semantic information, thereby improving the scientificity and objectivity of the list.
Drawings
Fig. 1 is a flowchart of a domain-oriented evaluation prediction technique list generation method according to embodiment 1 of the present invention;
fig. 2 is a block diagram of a domain-oriented evaluation prediction technique list generation system according to embodiment 2 of the present invention.
Detailed Description
The method of the technical scheme of the invention has the following route:
step 1) extracting technical nouns from massive scientific and technical information textsIdentifying and constructing mapping corpus Dall
Step 2) mapping corpus DallPerforming domain modeling and constructing a domain mapping corpus Ddomain
Step 3) mapping corpus D to the fielddomainMaking word frequency aggregation statistics of technical nouns, and taking TopN to obtain word List of high-frequency technical nouns in fieldtech
Step 4) word List of high-frequency technical nouns in fieldtechCalculating indexes such as 'emerging degree' and 'maturity' corresponding to each technical noun to obtain a field initial selection Listoriginal
Step 5) initially selecting the List of the fieldsoriginalRandomly sampling to obtain ListtrainOrder is obtained by manual marking of domain expertstrainUsing SVMrank algorithm to pair ListtrainAnd OrdertrainTraining to obtain a sequencing ModelSVMrank
Step 6) utilizing a sequencing ModelSVMrankList for initial selection of fieldsoriginalSorting, and taking topK as a field initial Listinit
Step 7) initial List of domainsinitCompleting information based on open source knowledge base to obtain domain detailed Listdetail
Step 8) constructing a class detailed list FList containing positive and negative examples aiming at the technology based on the open source knowledge basedetailAdopting a fasttext classification algorithm to carry out on FListdetailTraining is carried out to obtain a technical secondary classification Modelclassify
Step 9) Model of technology-based secondary classification ModelclassifyList of domain detail listsdetailPerforming technical judgment, and further filtering by combining a rule matching method to obtain a field technical List Listtech-domain
The step 1) specifically comprises the following steps:
step 1-1) identifying all attributions in the technical information text by utilizing a tagme algorithm to link entities after identifying noun phrases of the massive technical information textSubsequent entity word set WentityAnd performing associated mapping with the text;
step 1-2) matching the scientific and technological text to extract technical words contained in the scientific and technological text to obtain a technical word set W for the massive scientific and technological information text according to the accumulated technical word listtechAnd performing associated mapping with the text;
step 1-3) matching from massive scientific and technical information texts by utilizing the word list in the prior art to obtain a sentence set S containing technical wordstechThen using [ B-tech, I-tech, O ]]Wait for the label to be a sentence set StechPerforming sequence labeling, and training a technical noun recognition Model by using a Bi-LSTM-CRF algorithmdetect
Step 1-4) identifying Model by using technical nounsdetectIdentifying technical nouns in scientific and technical texts to obtain a technical noun set Wtech-normAnd performing associated mapping with the text;
step 1-5) for entity word set WentityTechnical word set WtechTechnical name word set Wtech-normGiving different word frequency weights according to the credibility to obtain a mapping corpus Dall
The step 2) specifically comprises the following steps:
step 2-1) mapping corpus DallRandomly sampling to obtain training set 1Dtrain1Extracting the keyword pair mapping corpus D by the domain expert aiming at the domain characteristicsallSearch to obtain training set 2Dtrain2Thereby constructing a domain training corpus Dtrain
Step 2-2) obtaining tag through manual marking of domain expertstrainMarking whether each information text in the field training corpus is a field information text, constructing a classification training corpus, and adopting a classification fasttext classification algorithm to perform DtrainAnd tagtrainTraining is carried out to obtain a domain classification Modeldomain-detect
Step 2-3) utilizing a domain classification Modeldomain-detectClassifying the scientific and technological information texts to obtain a domain mapping corpus Ddomain
The step 4) specifically comprises the following steps:
step 4-1) of List of high-frequency technical noun words of the domaintechWord w in (1), mapping corpus D in the domaindomainCounting annual word frequency of nearly 10 years to obtain Countw=[c1,c2,...,c10]The emerging value Escore of the word is obtained by the following calculation processw
Existw=[e1,e2,...,e10],ei=1 if ci>0 else 0
Countw_base=∑i=1,2,3ci
Countw_active=∑i=4..10ci
Suppose sigma Existw> 3 and ∑ Countw> 7 and Countw_active/Countw_base> 2 and Countw_active/∑Countw< 0.15, calculated according to the following formula: otherwise, the emerging value of the word is minus infinity;
Figure BDA0002827666680000081
Figure BDA0002827666680000082
Figure BDA0002827666680000083
Escorew=2*APT+(RT+MYS)
wherein ExistwA sequence representing the annual occurrence of the word, eiRepresenting whether the word appears in year i, ciRepresents the number of records, Count, that the word appeared in the ith yearw_baseRepresents the total word frequency, Count of the word in the basic periodw_activeRepresents the total word frequency of the word in the active period, APT represents the trend of the word in the active period, RT represents the recent trend of the word, and MYS represents the medium-term to recent change rate of the word. In order to make the above-mentionedThe new calculation formula has universality and needs two special processes: the word frequency replaces zero value with minimum value to ensure normal division, and 1 replaces c7-c4A zero value;
step 4-2) of the List of high-frequency technical noun words of the domaintechWord w in (1), mapping corpus D in the domaindomainCounting annual word frequency of nearly 10 years to obtain Countw=[c1,c2,...,c10]The Maturity value Maturity of the word is calculated by the following calculation processw
Ratew=[r1,r2,...,r9],ri=1 if ci+1-ci>0 else-1
Maturityw=[5+∑Ratew/3]∈{6,7,8},if∑Ratew>2
Maturityw=[10-δ/∑ci]∈{9,10},if∑ri≤2,∑ci≥δ
Maturityw=[0.5+5*(∑ci)/δ]∈{1,2,...,5},if∑ri≤2,∑ci<δ
Wherein, ciThe word frequency number, Rate, of the word in the i-th yearwRepresenting whether the word has a sequence of annual growth or not, riRepresenting whether the year i to year i +1 has increased, δ is the empirical threshold, and the suggested value is one hundredth of the total text in the last decade of the field.
Step 4-3) calculating according to the step 4-1) and the step 4-2) to obtain an index value of each technical noun, and adding other index values such as total word frequency, near three-year word frequency, ratio of near three-year word frequency to total word frequency, ratio of total word frequency to field word frequency and the like to obtain an Indicatorw=[Escorew,Maturityw,AllCountw,...]Obtaining the List of the first selection of the domainoriginal
The step 5) is specifically as follows:
list for initial selection of fieldsoriginalRandom sampling to obtain training Listoriginal_trainObtaining the ordering Order aiming at the training set through the manual marking of the domain experttrainThe training data record Order is constructed according to the following formatw qid:w1:Escorew 2:Maturityw 3:AllCountw 4:
Figure BDA0002827666680000091
5:
Figure BDA0002827666680000092
6:
Figure BDA0002827666680000093
... # tech _ norm adopts SVMrank sorting algorithm to train and obtain sorting ModelSVMrank
The step 7) is specifically as follows:
based on open source knowledge base, through domain initial ListinitThe technical nouns are matched with the rich semantic information of the knowledge base, information including Chinese names, English names, explanations and the like is extracted to complete the technical words, and a domain detailed List List is obtaineddetailWherein, the open source knowledge base comprises Wikipedia, Baidu encyclopedia, Jian's defense and various information corpora and the like.
The step 8) is specifically as follows:
based on the open source knowledge base, construct a noun list NormList containing partial technical nouns, and use step 7 to obtain the detailed list FList for the noun listdetailIs a detailed list FListdetailWhether each record mark is a technical noun or not is used as a training corpus, and a technical two-classification Model is obtained by adopting fasttext classification algorithm trainingclassify
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, embodiment 1 of the present invention discloses a domain-oriented method for generating a list of evaluation and prediction technologies, which specifically includes:
step 1) extracting and identifying technical nouns of massive scientific and technical text information to obtain mapping corpora, which is as follows:
step 1-1) training a tagme entity linking tool for Wikipedia data, performing noun phrase recognition on massive scientific and technological information, then performing entity linking to obtain normalized noun phrase entity words, and mapping the normalized noun phrase entity words with a scientific and technological text;
step 1-2) collecting technical nouns from emerging technical categories in Wikipedia to construct a technical word list, then matching and extracting the technical words existing in the technical word list from massive technical information texts, and mapping the technical words with the technical texts;
step 1-3) extracting a sentence set containing the technical words from the massive scientific and technological information text according to the technical word list, carrying out sequence labeling on the sentence set by using [ B-tech, I-tech, O ] labels, training a technical noun phrase recognition model by adopting a Bi-LSTM-CRF algorithm, extracting technical noun phrases in the massive scientific and technological information text, and mapping the technical noun phrases with the scientific and technological text;
step 1-4) assigning values to word frequency credibility of noun phrase entity words, technical words and technical noun phrases according to [0.8,1 and 0.4] respectively to form a uniform mapping corpus;
step 2) performing domain modeling on the mapping corpus, which comprises the following specific steps:
step 2-1) obtaining a mapping corpus training set by random sampling and domain keyword retrieval sampling of the mapping corpus, carrying out domain two-classification labeling on the mapping corpus training set by a domain expert, constructing a classification training corpus of an information text and a domain label, and training by adopting a fasttext classification algorithm to obtain a domain classification model;
step 2-2) classifying the scientific and technological information texts by using a domain classification model to obtain a domain mapping corpus;
step 3) performing index evaluation on the domain mapping corpus, which specifically comprises the following steps:
step 3-1), performing word frequency aggregation statistics on the technical nouns of the domain mapping corpus, and taking the top 1000 items as domain high-frequency technical nouns;
step 3-2) carrying out statistics on the word frequency sequence of the high-frequency technical nouns in the near ten years in a field prediction set, then calculating the emerging value and the maturity of the high-frequency technical nouns, and carrying out statistics on indexes such as total word frequency, near three year word frequency, ratio of near three year word frequency to total word frequency, ratio of total word frequency to field word frequency and the like to form a primary selection list, wherein part of results are shown in the following table:
Figure BDA0002827666680000111
step 4) sorting the primary selection list, which comprises the following specific steps:
step 4-1) randomly sampling the primary selection list to obtain a sequencing training set, scoring the sequencing training set according to relevance and importance through a field expert, obtaining a sequencing result after weighted average, and constructing a sequencing training corpus in the following format:
5 qid:1 1:38.13 2:2 3:14.6 4:12 5:0.8219 6:0.34#Phase(waves)
8 qid:2 1:29.87 2:1 3:6.7 4:6 5:0.8955 6:0.0016#Dielectric resonator antenna
the meaning is as follows:
phase (waves): the emerging degree index is 38.13, the maturity index is 2, the word frequency 1 is 14.6, the word frequency 2 is 12, the proportion 1 is 82.19%, the proportion 2 is 0.34%, and the ranking after weighting according to the scoring of experts is 5;
dielectric resonator antenna: the emerging index is 29.87, the maturity index is 1, the word frequency number 1 is 6.7, the word frequency number 2 is 6, the percentage 1 is 89.55%, the percentage 2 is 0.16%, and the ranking after the weighting according to the grading of experts is 8;
wherein, the emerging degree is negative infinity and is replaced by an experience great negative value, in the experiment, the user uses-9999, and trains the corpus by adopting SVMrank ordering algorithm to obtain an ordering model;
step 4-2) sequencing the primary selection list by utilizing a sequencing model, and taking the top 100 as a field initial list;
step 5) completing the information of the initial list of the fields, which comprises the following steps:
step 5-1) using wikidata, Baidu encyclopedia and other data to match technical nouns in the field initial list and collect relevant semantic information, such as Chinese names, explanations and belonging category information, and constructing detailed information records aiming at the technical nouns to form a field detailed list, for example, as follows:
"Edge computing, die casting, distributed computing, parallel computing, programmable mapping, mapping, Edge computing, means an open platform that integrates network, computing, storage, and application core capabilities near the object or data source to provide the nearest service, Edge computing to the nearest computing, storage, and application core capabilities"
Step 6) carrying out secondary cleaning on the detailed field list, which comprises the following specific steps:
step 6-1) taking 100 emerging technical nouns from the existing word list, simultaneously randomly extracting 200 entries from Wikipedia non-emerging technical categories, constructing a detailed list record according to the method of step 5), and marking the detailed list record according to whether the technology is adopted to form a two-classification training corpus;
step 6-2) training the two-classification training corpus by adopting a fasttext classification algorithm to obtain a technical two-classification model;
step 6-3) performing technical judgment on the field detailed list by adopting a technical secondary classification model to obtain a field technical list;
step 6-4) removing some categories which are obviously not technical according to the category information in the detailed list, wherein the categories are countries, organizations, figures, music albums and the like to obtain a final technical list of the field;
example 2
As shown in fig. 2, based on the above method, embodiment 2 of the present invention provides four trained models of a domain-oriented evaluation prediction technology list generation system, a technical noun recognition and extraction module, a domain modeling module, an index evaluation module, a ranking module, a semantic completion module, and a secondary cleaning module; wherein, the model includes: technical noun recognition model, domain classification model, sequencing model and binary classification model.
The technical noun identification and extraction module is used for extracting and identifying technical nouns of massive scientific and technological information texts to obtain a mapping corpus;
the domain modeling module is used for classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;
the index evaluation module is used for carrying out word frequency aggregation statistics on technical nouns on the domain mapping corpus, extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list, and respectively calculating an emerging degree index and a maturity degree index for each technical noun in the domain high-frequency technical noun word list to obtain a domain initial selection list;
the sequencing module is used for sequencing the initial field selection lists by adopting a pre-trained sequencing model and extracting a plurality of the previous initial field selection lists;
the semantic completion module is used for completing the information of the initial domain list based on the open source knowledge base to obtain a detailed domain list;
and the secondary cleaning module is used for inputting the detailed field list into a pre-trained technical secondary classification model for technical judgment, and further filtering by combining a rule matching method to obtain a technical field list.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A domain-oriented evaluation prediction technology list generation method comprises the following steps:
step 1) extracting and identifying technical nouns of a mass scientific and technical information text to obtain a mapping corpus;
step 2) classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;
step 3) carrying out word frequency aggregation statistics on the technical nouns on the domain mapping corpus, and extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list;
step 4) calculating an emerging degree index and a maturity degree index for each technical noun in the field high-frequency technical noun word list respectively to obtain a field initial selection list;
step 5) sorting the initial field selection lists by adopting a pre-trained sorting model, and extracting a plurality of the previous initial field selection lists;
step 6) performing information completion on the initial list of the field based on the open source knowledge base to obtain a detailed list of the field;
and 7) inputting the domain detailed list into a pre-trained technical secondary classification model for technical judgment, and further filtering by combining a rule matching method to obtain a domain technical list.
2. The domain-oriented assessment and prediction technology list generation method according to claim 1, wherein the step 1) specifically comprises:
step 1-1) carrying out noun phrase identification on a mass scientific and technical information text, then carrying out entity linkage by adopting a Tagme algorithm, identifying an entity word set after normalization in the scientific and technical information text, and carrying out associated mapping with the text;
step 1-2) matching and extracting a mass scientific and technical information text according to a pre-accumulated technical word list to obtain a technical word set, and performing associated mapping on the technical word set and the text;
step 1-3) identifying technical nouns in the scientific and technical text according to a pre-trained technical noun identification model to obtain a technical noun set, and performing associated mapping with the text;
and 1-4) giving different word frequency weights to the entity word set, the technical word set and the technical name word set according to the credibility to obtain a mapping corpus set.
3. The domain-oriented evaluation prediction technology list generation method according to claim 2, wherein the step 4) specifically comprises:
step 4-1) for each technical noun w in the domain high-frequency technical noun word list, counting the years of nearly 10 years of domain mapping corpus centralizationWord frequency Countw=[c1,c2,...,ci,...,c10],ciFor the annual word frequency of the ith year, the sequence Exist of whether the technical noun appears or not is calculated by the following formulawComprises the following steps:
Existw=[e1,e2,...,ei,...,e10],ei=1ifci>0else0
wherein e isiRepresents whether the technical noun appears in the ith year, the appearance is 1, otherwise, the appearance is 0;
calculating the total number of word frequencies Count of the technical noun in the basic period by the following formulaw_baseComprises the following steps:
Countw_base=∑i=1,2,3ci
calculating the total word frequency Count of the technical noun in the active period by the following formulaw_activeComprises the following steps:
Countw_active=∑i=4..10ci
determine when the term is in ExistwMore than 3 times in, and in CountwMore than 7 times and Countw_active/Countw_base> 2 and Countw_active/∑CountwWhen the value is less than 0.15, the emerging value Escore of the technical noun is calculated by the following formulawElse, the emerging value Escore of the terminologywIs minus infinity:
Escorew=2*APT+(RT+MYS)
wherein APT is the active period trend of the terminology:
Figure FDA0002827666670000021
the recent trend RT of this terminology is calculated from the following formula:
Figure FDA0002827666670000022
when c is going to7-c4When it is 0, let c7-c41, the medium to near term rate of change MYS of the term is calculated from:
Figure FDA0002827666670000023
step 4-2) calculating whether the annual growth sequence Rate of the term is increased or not from the following formulawComprises the following steps:
Ratew=[r1,r2,...,ri,...,r9],ri=1ifci+1-ci>0else-1;
wherein r isiRepresenting whether the growth is increased from the ith year to the i +1 year, if the growth is 1, otherwise, the growth is-1;
the Maturity value of the term is calculated from the following formulawComprises the following steps:
Maturityw=[5+∑Ratew/3]∈{6,7,8},if∑Ratew>2
Maturityw=[10-δ/∑ci]∈{9,10},if∑ri≤2,∑ci≥δ
Maturityw=[0.5+5*(∑ci)/δ]∈{1,2,...,5},if∑ri≤2,∑ci<δ
wherein, the delta is an empirical threshold value, and the value is one percent of the total text number of the last decade in the field;
step 4-3) according to the emerging value Escore of the technical nounwAnd Maturity value MaturitywThen adding the index values of the total word frequency, the word frequency of nearly three years, the ratio of the word frequency of nearly three years to the total word frequency, and the ratio of the total word frequency to the field word frequency to obtain the index value of the technical noun w
Indicatorw=[Escorew,Maturityw,AllCountw,...]And further obtaining a field initial selection list corresponding to the field high-frequency technical noun word list.
4. The domain-oriented evaluation prediction technology list generation method according to claim 3, wherein the step 6) is specifically:
based on an open source knowledge base, semantic information of the knowledge base is matched through technical nouns of a field initial list, Chinese names, English names, explanation information and belonging category information are extracted, technical words are completed through information, and a field detailed list is obtained.
5. The method for generating a domain-oriented evaluation prediction technology list according to claim 2, further comprising a training step of a technical noun recognition model, specifically:
extracting a sentence set containing technical words in a technical word list from a massive technical information text according to a pre-accumulated technical word list;
using [ B-tech, I-tech, O ] labels as sentence sets for sequence marking; b-tech represents the first character of the technical word, I-tech represents other characters which are not the first character in the technical word, and O represents other characters or punctuation marks of the non-technical word;
and training by adopting a Bi-LSTM-CRF algorithm to obtain a technical noun recognition model.
6. The domain-oriented evaluation prediction technology list generation method according to claim 1, further comprising a training step of a domain classification model, specifically:
randomly sampling the existing mapping corpus;
extracting keywords aiming at the domain characteristics by a domain expert to retrieve the mapping corpus of random sampling to obtain a training set, thereby constructing a domain training corpus;
manually marking by a domain expert to obtain a label, marking whether each information text in the domain training corpus is a domain information text, and constructing a classified training corpus;
and training by adopting a fasttext classification algorithm to obtain a field classification model.
7. The domain-oriented evaluation prediction technology list generation method according to claim 1, further comprising a training step of a ranking model, specifically:
randomly sampling an existing field primary selection list to obtain a training set;
the method comprises the following steps that a domain expert scores a sequencing training set according to relevance and importance, a negative experience value is used for replacing a new emerging index with a negative infinite value, a sequencing result is obtained through weighted average, and a sequencing training corpus is constructed;
and training by adopting an SVMrank ordering algorithm to obtain an ordering model.
8. The domain-oriented assessment prediction technology manifest generation method of claim 6, further comprising: the step of training the technology two classification model specifically comprises the following steps:
constructing a noun list containing partial technical nouns based on the open source knowledge base;
marking each record of the existing detailed list as a training corpus according to whether the record is a technical noun or not;
and training by adopting a fasttext classification algorithm to obtain a technical two-classification model.
9. A domain-oriented evaluation prediction technology inventory generation system, the system comprising: the trained technical noun recognition model, the field classification model, the sequencing model, the technical secondary classification model, the technical noun recognition extraction module, the field modeling module, the index evaluation module, the sequencing module, the semantic completion module and the secondary cleaning module; wherein the content of the first and second substances,
the technical noun identification and extraction module is used for extracting and identifying technical nouns of massive scientific and technological information texts to obtain a mapping corpus;
the domain modeling module is used for classifying the mapping corpus by adopting a pre-trained domain classification model to obtain a domain mapping corpus;
the index evaluation module is used for carrying out word frequency aggregation statistics on technical nouns on the domain mapping corpus, extracting a plurality of prior technical nouns to obtain a domain high-frequency technical noun word list, and respectively calculating an emerging degree index and a maturity degree index for each technical noun in the domain high-frequency technical noun word list to obtain a domain initial selection list;
the sequencing module is used for sequencing the initial field selection lists by adopting a pre-trained sequencing model and extracting a plurality of the previous initial field selection lists;
the semantic completion module is used for completing the information of the initial domain list based on the open source knowledge base to obtain a detailed domain list;
and the secondary cleaning module is used for inputting the detailed field list into a pre-trained technical secondary classification model for technical judgment, and further filtering by combining a rule matching method to obtain a technical field list.
CN202011434352.XA 2020-12-10 2020-12-10 Technical list generation method and system for field evaluation prediction Pending CN112463928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011434352.XA CN112463928A (en) 2020-12-10 2020-12-10 Technical list generation method and system for field evaluation prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011434352.XA CN112463928A (en) 2020-12-10 2020-12-10 Technical list generation method and system for field evaluation prediction

Publications (1)

Publication Number Publication Date
CN112463928A true CN112463928A (en) 2021-03-09

Family

ID=74801108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011434352.XA Pending CN112463928A (en) 2020-12-10 2020-12-10 Technical list generation method and system for field evaluation prediction

Country Status (1)

Country Link
CN (1) CN112463928A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008021028A (en) * 2006-07-11 2008-01-31 Hitachi Software Eng Co Ltd Keyword extraction system and keyword classification system
CN109657064A (en) * 2019-02-28 2019-04-19 广东电网有限责任公司 A kind of file classification method and device
CN111079419A (en) * 2019-11-28 2020-04-28 中国人民解放军军事科学院军事科学信息研究中心 Big data-based national defense science and technology hot word discovery method and system
CN111325036A (en) * 2020-02-19 2020-06-23 毛彬 Emerging technology prediction-oriented evidence fact extraction method and system
WO2020232861A1 (en) * 2019-05-20 2020-11-26 平安科技(深圳)有限公司 Named entity recognition method, electronic device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008021028A (en) * 2006-07-11 2008-01-31 Hitachi Software Eng Co Ltd Keyword extraction system and keyword classification system
CN109657064A (en) * 2019-02-28 2019-04-19 广东电网有限责任公司 A kind of file classification method and device
WO2020232861A1 (en) * 2019-05-20 2020-11-26 平安科技(深圳)有限公司 Named entity recognition method, electronic device and storage medium
CN111079419A (en) * 2019-11-28 2020-04-28 中国人民解放军军事科学院军事科学信息研究中心 Big data-based national defense science and technology hot word discovery method and system
CN111325036A (en) * 2020-02-19 2020-06-23 毛彬 Emerging technology prediction-oriented evidence fact extraction method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张晖;: "关于建立面向应用的规范词异名库的若干理论探讨", 中国科技术语, no. 04, 25 August 2013 (2013-08-25) *
张瑞;赵栋祥;唐旭丽;钱宇星;: "知识流动视角下学术名词的跨学科迁移与发展研究", 情报理论与实践, no. 01, 20 August 2019 (2019-08-20) *
黄鲁成;郝亚丽;李晋;苗红;: "基于多源数据的养老科技技术体系识别研究", 世界科技研究与发展, no. 06, 15 December 2019 (2019-12-15) *

Similar Documents

Publication Publication Date Title
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
US10754883B1 (en) System and method for insight automation from social data
Alshaer et al. Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application
CN110543564B (en) Domain label acquisition method based on topic model
Tang et al. Multi-label patent categorization with non-local attention-based graph convolutional network
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN111158641B (en) Automatic recognition method for transaction function points based on semantic analysis and text mining
CN108932318A (en) A kind of intellectual analysis and accurate method for pushing based on Policy resources big data
CN108681548A (en) A kind of lawyer&#39;s information processing method and system
KR20160149050A (en) Apparatus and method for selecting a pure play company by using text mining
CN113157903A (en) Multi-field-oriented electric power word stock construction method
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
Mustafa et al. A comprehensive evaluation of metadata-based features to classify research paper’s topics
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity
Mustafa et al. Optimizing document classification: Unleashing the power of genetic algorithms
Petrus Soft and hard clustering for abstract scientific paper in Indonesian
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
Sudha Semi supervised multi text classifications for telugu documents
CN112463928A (en) Technical list generation method and system for field evaluation prediction
Handayani et al. Sentiment Analysis of Bank BNI User Comments Using the Support Vector Machine Method
Sanwaliya et al. Categorization of news articles: A model based on discriminative term extraction method
Mallek et al. An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents
Alharithi Performance Analysis of Machine Learning Approaches in Automatic Classification of Arabic Language
Shekhar Text Mining and Sentiment Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210622

Address after: No.26 Fucheng Road, Haidian District, Beijing 100142

Applicant after: MILITARY SCIENCE INFORMATION RESEARCH CENTER OF MILITARY ACADEMY OF THE CHINESE PLA

Address before: No.26 Fucheng Road, Haidian District, Beijing 100142

Applicant before: Mao Bin

Applicant before: MILITARY SCIENCE INFORMATION RESEARCH CENTER OF MILITARY ACADEMY OF THE CHINESE PLA

TA01 Transfer of patent application right