CN103631859A

CN103631859A - Intelligent review expert recommending method for science and technology projects

Info

Publication number: CN103631859A
Application number: CN201310509358.2A
Authority: CN
Inventors: 徐小良; 吴仁克; 林建海; 陈秋
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2014-03-12
Anticipated expiration: 2033-10-24
Also published as: CN103631859B

Abstract

The invention provides an intelligent review expert recommending method for science and technology projects. The method includes the following steps that (1) the science and technology projects to be reviewed and expert information main texts are segmented into substring sequences, ICTCLAS segmentation of Chinese academy of sciences is carried out on the substring sequences, and stop word filtering is carried out on a segmentation result to obtain a term set; (2) a term network of project information is built, feature words are extracted on the basis of statistical characteristics and aggregation characteristics, and if expert information is relatively concise, the term set obtained in the step (1) directly serves as the feature words; (3) a knowledge representation model is built on the basis of fields and weights of the feature words, and a relative information index is built; (4) experts are recommended in groups to carry out feature merging operations between the fields and between the projects on the knowledge representation model; (5) similarity of the experts and the science and technology projects or groups to be viewed is calculated on the basis of semantics, threshold truncation is set, and a final recommended expert list is generated. By means of the method, the problems that recommending workload is large and review decisions lack scientificity are greatly alleviated.

Description

A kind of evaluation expert's intelligent recommendation method towards science and technology item

Technical field

The invention belongs to expert's recommended technology field, relate in particular to a kind of science and technology item evaluation expert intelligent recommendation method of service Network Based, it is a kind of intelligent method of auxiliary science and technology item Authorize to Invest.

Background technology

Along with science and technology item management system is universal rapidly in each functional department of China, the evaluation of science and technology item develops into current network schemer from concentrated conference model in the past, has broken the restriction of expert region in evaluation.Whether evaluation expert, according to the subsidy standard of domain knowledge and subsidy mechanism, appraises through discussion project application book, subsidize mechanism and subsidize according to expert's the situation of appraising through discussion decision.

Expert towards science and technology item recommends mostly only with project manager's subjective consciousness, to recommend expert to evaluate pending trial project at present, a pending trial project often needs a plurality of experts to evaluate, certainly will there is the problems such as efficiency is not high, workload large, shortage is scientific in artificial recommendation expert, the expert who selects is not most suitable.Therefore, to the research of science and technology item evaluation expert intelligent recommendation, be very crucial, can effectively alleviate expert and the problem such as not mate with the commented contents of a project, greatly promote the community service ability of science and technology item evaluation.

Intelligent recommendation technology now, as collaborative filtering recommending, content-based recommendation etc., mostly be applied in video display recommended website, commercial product recommending website, rarely have research and application in science and technology item evaluation expert information bank, restriction due to specific area, for science and technology item intelligent recommendation expert's technology and general recommended technology or distinguishing: first, the recommendation of science and technology item management system relates to all trades and professions, and domain knowledge is very complicated; Secondly, science and technology item evaluation expert's recommendation relates to the sustentation fund of science and technology item, and the requirement of objectivity, fairness and the accuracy that expert is recommended is very high.At present in this respect, China also lacks systematized method guidance and proven technique support.And information text has features such as " semi-structured ", the content of expert info and pending trial science and technology item information can be mated, and the present invention makes full use of architectural feature and phrase semantic information computational item and expert's information similarity.If similarity is higher, represent that expert is familiar with this project, produce and recommend expert's list to evaluate project.The present invention provides a kind of decision support system (DSS) (Decision Support System for science and technology item recommendation evaluation expert simultaneously, DSS), evaluation expert is assigned to the project that domain knowledge matches and carries out science evaluation, make auxiliary expert (decision-making user) realize the decision-making of science, aid decision making user improves level of decision-making and quality, makes evaluation have more science and objectivity.

Summary of the invention

The present invention is directed to the deficiencies in the prior art, a kind of evaluation expert's intelligent recommendation method towards science and technology item is provided.

The present invention comprises the steps: towards evaluation expert's recommendation process of science and technology item

Step 1. dictionary of stopping using using the general term in science and technology item and expert info and habitual word as specialty; Using punctuation mark, non-Chinese character as cutting signature library.

Step 2. pair science and technology item information, expert info carry out participle: according to cutting mark in science and technology item information, the information such as project name, main research, technical indicator are cut into substring sequence; According to cutting mark in evaluation expert's information, the information such as the project that extraction expert info, prize-winning situation, invention situation, the situation that publishes thesis, problem were born and performance, research direction are cut into substring sequence, and a sub-string sequence is a field information; Utilize the ICTCLAS of Chinese Academy of Sciences antithetical phrase string sequence to carry out participle.

Step 3. science and technology item feature word extraction: utilize the inactive dictionary of general inactive dictionary and specialty to carry out stop words filtration to participle, general inactive dictionary adopts the inactive vocabulary of Harbin Institute of Technology, using the word segmentation result of removing stop words as a set of words.

The structure of the inactive dictionary of specialty is constantly perfect process of a self study, the word frequency of constantly adding up word in information participle process, and the probability that word occurs at text is greater than certain threshold values, brings it into inactive dictionary.

Science and technology item quantity of information is larger, set of words is carried out to semantic similarity between word and calculate, and according to the cooccurrence relation of the semantic relation of word and word, builds term network, the word aggregation characteristic value in computational grid; Then in conjunction with the statistical characteristics of word, the crucial degree that calculates word extracts science and technology item feature word; The feature word of science and technology item is exactly statistical nature information and the semantic feature information of extracting comprehensive text, extracts more exactly feature word.

Described semantic similarity computation process is as follows:

In knowing net semantic dictionary, if for two word W ₁and W ₂, W ₁there is n concept: S11, S12 ..., S1n, W ₂there is m concept: S21, S22 ..., S2n.Word W ₁and W ₂similarity SimSEM (W1, W2) equal the maximal value of the similarity of each concept:

SimSEM (W 1, W 2) = \max_{i = 1, . . . n . j = 1 . . . m} Sim (S_{1 i}, S_{2 i})

Notional word and function word have different descriptive languages, need to calculate the adopted similarity between former of the former or relation of its corresponding syntax justice.Notional word concept comprises that the first basic meaning is former, other basic meanings are former, the adopted former description of relation, relational symbol are described, and similarity is designated as respectively Sim1 (p ₁, p ₂), Sim2 (p ₁, p ₂), Sim3 (p ₁, p ₂), Sim4 (p ₁, p ₂).The similarity calculating of two feature structures finally reverts to basic meaning similarity former or concrete word and calculates.

{Sim}_{4} (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} {Sim}_{i} (S_{1}, S_{2})

β _i(1≤i≤4) are adjustable parameters, and have: β ₁+ β ₂+ β ₃+ β ₄=1, β ₁>=β ₂>=β ₃>=β ₄.

If CW={C1, C2 ..., the set of words of Cm} for obtaining after processing, the semantic similarity adjacency matrix S that it is corresponding _mbe defined as:

Wherein, Sim (C ₁, C ₂) be word C ₁with word C ₂semantic similarity, Sim (C _i, C _i) be 1, Sim (C _i, C _j)=Sim (C _j, C _i).

Set of words CW={C1, C2 ..., Cm} calculates the value of similarity between p* (1+p)/2 word through semantic similarity.

The cooccurrence relation computation process of described word is as follows:

Word co-occurrence patterns are one of important models of the natural language processing research field based on statistical method.According to word co-occurrence patterns, if two frequent co-occurrences of word in the same window unit of document (as a word, a paragragh etc.), these two words are to be mutually related in meaning, they express the semantic information of the text to a certain extent.Utilize moving window (moving window length is 3) to carry out word co-occurrence degree to the word in sequence of terms and calculate, moving window as shown in Figure 1:

First, sequence of terms is carried out to word extraction, remove space, null and merge identical word, obtains set of words CW={C1, C2 ..., Cm}, wherein m≤n.

The word co-occurrence degree Matrix C m that set of words CW is corresponding is defined as:

When Cm is initial, Coo (Ci, Cj) is 01(1≤i, j≤m).

By moving window, sequence of terms is carried out to word co-occurrence degree and calculate, the word in moving window is T _i-1t _it _i+1(1<i<n):

1) if i=n-1 turns 4); If T _i-1be space or null, moving window slides to next word, i++; Otherwise, turn 2).

2) if T _ifor Chinese, Coo(T _i-1, T _i) ++, turn 3); If T _ifor null, turn 3); Otherwise turn 1).

3) if T _ichinese, Coo (T _i-1, T _i+1) ++, i++, turns 1); Otherwise, turn 1).

4) if T _n-2be Chinese, turn 5); Otherwise, turn 7)

5) if T _n-1chinese, Coo (T _n-2, T _n-1) ++, turn 6); If T _n-1be space, turn 6); Otherwise finish.

6) if T _nchinese, Coo (T _n-2, T _n) ++, finish; Otherwise finish.

7) if T _n-1chinese, and T _nalso be Chinese, Coo (T _n-1, T _n) ++, finish; Otherwise finish.

Through the calculating of step above, obtain word co-occurrence degree Matrix C m, and each element of Cm is normalized, namely each element is divided by the maximal value of all elements in matrix, i.e. max{Coo (C _i, C _j) | 1≤i, j≤m}.

Described term network is as follows:

When building cum rights term network, first to obtain the weight matrix of term network, definition weight matrix Wm is:

Wherein, α is that 0.3, β is 0.7, and the semantic relation between strengthening word, weakens the cooccurrence relation between word.

W _mas the adjacency matrix corresponding to term network of input, its corresponding network chart is defined as: G={V, E}; Wherein scheming G is undirected weighted graph, the vertex set in V presentation graphs G, and E represents the limit collection in G, v _irepresent i summit (word) in V.

The computation process of described word aggregation characteristic value is as follows:

Key character degree of the having distribution of term network, average shortest path, concentration class and convergence factor.The degree of node embodies the associated situation of this node and other node.The node that the concentration class of node and convergence factor are embodied in this node subrange interconnects density.The degree of node and convergence factor embody the importance of this node in subrange.The present invention, by the aggregation characteristic value that measures and weights, convergence factor and node betweenness are carried out computing node that adds of node, can allow important word give higher weights, and guaranteeing again also has higher scoring with the related word of many important words.

In semantic similarity network chart, unordered couple (v _i, v _j) expression node v _iwith v _jbetween limit, node v _ithe measures and weights that adds be defined as:

{WD}_{i} = Σ_{j = 1}^{n} w_{ij} / n

Wherein, w _ijfor node v _iwith v _jbetween weights on limit, total number that n is node.

In semantic similarity network chart, unordered couple (v _i, v _j) expression node v _iwith v _jbetween limit, node v _ithe non-measures and weights D that adds _ifor D _i=| { (v _i, v _j): (v _i, v _j) ∈ E, v _i, v _j∈ V}|; Node v _iconcentration class K _ifor the actual limit number existing between neighbor node: T _i=| { (v _j, v _k): (v _i, v _k) ∈ E, (v _j, v _k) ∈ E, v _i, v _j∈ V}|, node v _jconvergence factor C _ibe defined as:

C_{i} = \frac{T_{i}}{(\begin{matrix} D_{i} \\ 2 \end{matrix})} = 2 T_{i} / D_{i} (D_{i} - 1)

In semantic similarity network chart, node betweenness Betweenness is between node x and w and shortest path passes through node v _ipossibility probability.Pair Analysis between two nonneighbor nodes depends on the node on the shortest path that connects point-to-point transmission, the potential role who controls interactive information stream between node, the B of playing the part of of these nodes _iembody node v _idegree of connecting under local environment, node betweenness Betweenness is defined as:

B_{i} = Σ_{v_{i}, w, x &Element; G} \frac{d (w, x; v_{i})}{d (w, x)}

D (w, x) represents shortest path number between any two node w and x, d (w, x; v _i) represent any two node w and x and pass through v _ishortest path number.

By node v _iaverage weighted degree, convergence factor and betweenness Betweenness be weighted the aggregation characteristic value of comprehensive measurement node, node v _iaggregation characteristic value Z _ibe defined as:

Z_{i} = a \times {WD}_{i} + b \times C_{i} / Σ_{j = 1}^{n} C_{j} + c \times B_{i}

Wherein, a+b+c=1.

The computation process of the statistical characteristics of described word is as follows:

Adopt nonlinear function to be normalized word frequency.Word W _iword frequency weight TFi in text is defined as:

TFi = \frac{f (Wi)}{Σ_{j = 1}^{n} f (p_{j})}

Wherein, TFi represents word W _iword frequency weight, p _jrepresent certain word in text, f is word frequency statistics function.

In Chinese text, energy nameplate characteristic is generally notional word, as noun, verb, adjective etc.And the function words such as interjection, preposition, conjunction are substantially nonsensical to determining text categories, can extract and bring very large interference feature word.Word W _ipart of speech weight posi in text is defined as:

Word is more long more can reflect concrete information, otherwise the represented meaning of shorter word is conventionally more abstract.Especially mostly the feature word in document is the academic combination of some specialties vocabulary, and length is longer, and its implication is clearer and more definite, more can reflect text subject.The weight that increases long word, is conducive to vocabulary to be cut apart, thereby reflects more accurately the significance level of word in document.

Word W _ithe long weight leni of word in text is defined as:

For each word in sequence of terms, its statistical characteristics is

stats _i=A*TF _i+B*pos _i+C*len _i

Wherein, A+B+C=1.

Described word W _ithe computation process of key degree is as follows:

Corresponding to each node in weighting term network, its crucial degree value Imp _ibe defined as:

Imp _i=β*stats _i+（1-β）*Z _i

Wherein, 0 < β < 1.

By calculating, the value of crucial degree sequence from big to small will be obtained, set a threshold gamma (0 < γ < 1), the value of q before taking out, these words are using the feature word as science and technology item, these words fully reflect theme, and are important words.

Step 4. evaluation expert feature word extracts: evaluation expert's quantity of information is few compared with science and technology item information, the Feature Words of science and technology item builds network the extractive technique based on statistical nature and semantic feature, the feature word that is not suitable for evaluation expert's information extracts, directly according to general inactive dictionary and the inactive dictionary of specialty, carry out stop words filtration, extract each expert's Feature Words set, general inactive dictionary is to be also to adopt the inactive vocabulary of Harbin Institute of Technology, and the inactive dictionary of specialty needs personnel constantly to safeguard.

Step 5. builds science and technology item, evaluation expert's minute field Knowledge Representation Model: by vector space model and matter-element Knowledge Set model are expanded, according to the different field information in science and technology item, set up text representation model PRO=(id, F, WF, T, V), wherein id is illustrated in the identification field in project library; F represents field classification set in science and technology item; WF is the weight of field; T is feature word; V represents that the corresponding word of field and weight set thereof are V _i={ v _i1, f (v _i1), v _i2, f (v _i2) ..., v _in, f (v _in), v _ijrepresent j feature word in i field, f (v _ij) expression v _ijthe corresponding frequency of keyword.The representation of knowledge of science and technology item information is as follows:

In like manner, according to the different field information in expert, set up Knowledge Representation Model TM=(id, F, WF, T, V).Wherein, id is illustrated in the identification field in experts database; F represents field classification set in evaluation expert; WF is the weight set of field; T is feature word; V represents that the corresponding feature word of field and weight set thereof are V _i={ v _i1, f (v _i1), v _i2, f (v _i2) ..., v _in, f (v _in), v _ijrepresent j feature word in i field, f (v _ij) expression v _ijthe frequency of occurrences of feature word in corresponding field.The representation of knowledge of evaluation expert's information is:

Step 5. evaluation expert information index storehouse builds: after evaluating Expert Knowledge Expression model construction and completing, information index is put in storage: the content item information that first reads an evaluation expert from experts database; Based on word segmentation result, set up phrase semantic network and extract the Feature Words that evaluation expert comprises; According to Knowledge Representation Model and utilize Apache Lucene to set up index to it; The index establishing is added in corresponding index database by affiliated classification, until all evaluation expert's index warehouse-in.

Step 6: according to the number of project, the way of recommendation is divided into single pending trial project recommendation expert and grouping (a plurality of) pending trial project recommendation expert.Grouping recommends expert to represent that to the pending trial project knowledge of step 5 model does the feature union operation between corresponding interfield and project, and single pending trial expert recommends only to do corresponding interfield feature union operation.Meanwhile, the evaluation expert's of step 5 Knowledge Representation Model is carried out to the merging of interfield feature.According to Knowledge Representation Model and utilize the characteristic information after Apache Lucene is combined to set up index.Wherein, science and technology item index construct carries out when carrying out project recommendation.

In science and technology item declaration management system, pending trial project needs grouping to recommend often, above-mentioned feature union operation not only guarantee not can removal process 5 in Knowledge Representation Model different field weight is set similarity is calculated and produced the contribution difference of recommending.

It is as follows that described pending trial project, evaluation expert's feature merga pass logic xor operation carries out process:

(1) pending trial project, an evaluation expert's interfield feature merges

Suppose field feature set of words W' ₁and W' ₂merge, define W' ₁and W' ₂merge rule

for:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{f ({word}_{1 i}) + f ({word}_{2 i})}{2}} | {word}_{1 i} = {word}_{2 j}}

Wherein, word _1i, word _2jfor Feature Words.

Add field weight to improve and expand above-mentioned definition, the interfield feature of evaluation expert, science and technology item is merged, merging rule is:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{w 1 * f ({word}_{1 i}) + w 2 * f ({word}_{2 i})}{\sqrt{{w 1}^{2} + {w 2}^{2}}}} | {word}_{1 i} = {word}_{2 j}}

(2) between the project of grouping pending trial project, feature merges

This merging process operation is only for the proper vector of pending trial science and technology item, and not for evaluation expert's proper vector, expert's proper vector only need to be done interfield feature union operation.If V is (d ₁) and V (d ₂) be respectively the vector models of two science and technology items after interfield feature merges, to any t _1j∈ V (d ₁), t _2j∈ V (d ₂), if there is t _1jwith t _2jidentical merging.

be defined as:

V (d_{1}) &CirclePlus; V (d_{2}) = {< t_{k}, w_{k} (p) = \frac{w_{i} (d_{1}) + w_{j} (d_{2})}{2} >}

Wherein, k=1 ..., n, t _kfor feature entry item, w _k(p) be t _kweight.

The basic process that the Knowledge Representation Model of science and technology item group produces is as follows:

A). merge science and technology item interfield feature, obtain the vector model V (d) of each project;

B). all science and technology item vector model set are adopted to consolidation strategy

by above-mentioned method, science and technology item is set up to the Knowledge Representation Model of the vector space that is based on.

V(p)＝{＜t ₁,w ₁(p)＞,＜t ₂,w ₂(p)＞,...,＜t _n,w _n(p)＞}

Wherein, k=1 ..., n, t _kfor project team's Feature Words entry item, w _k(p) be t _kweight.

Step 7. merges through the interfield feature of the evaluation expert of step 6 and the Knowledge Representation Model of science and technology item, if suppose, evaluation expert's information vector is expressed as P={s ₁, f (s ₁), s ₂, f (s ₂) ..., s _n, f (s _n), science and technology item information (group) vector representation is Q={t ₁, f (t ₁), t ₂, f (t ₂) ..., t _n, f (t _n), the semantic similarity based on maximum matching algorithm calculating pending trial science and technology item (group) vector with evaluation expert.

Step 8. arranges similarity and blocks, and according to the size generation of similarity, recommends index, produces final recommendation evaluation expert list.

Beneficial effect of the present invention is as follows:

Can be more easily, intelligently, recommend out scientific and technological project appraisal expert accurately; Can greatly alleviate the allocating task of science and technology item declaration management system scientific worker to evaluation expert, reduce the cost of management; Can guarantee that evaluation expert and pending trial science and technology item have higher field matching degree, guarantee that evaluation expert accomplishes objectivity, fairness and science to the evaluation of project, automatic, efficient, just decision support is provided, avoids science and technology item examining to occur that human feelings network of personal connections, " Matthew effect " etc. examine improper problem.

Accompanying drawing explanation

Fig. 1 carries out word co-occurrence degree to calculate moving window in the present invention.

Fig. 2 is the maximum matching algorithm principle schematic based on bigraph (bipartite graph) in the present invention.

Fig. 3 is towards evaluation expert's intelligent recommendation method flow diagram of science and technology item in the present invention.

Fig. 4 is the extraction process flow diagram of the Feature Words of science and technology item and evaluation expert's information in the present invention.

Fig. 5 is that in the present invention, evaluation expert's knowledge index storehouse builds process flow diagram.

Embodiment

Below in conjunction with accompanying drawing, the invention will be further described, should be emphasized that following explanation is only exemplary, rather than in order to limit the scope of the invention and to apply.Below the specific embodiment of the present invention is described in further detail, the embodiment based in invention, those of ordinary skills, at the every other embodiment that does not have to obtain under creative work prerequisite, belong to protection scope of the present invention.

As shown in Figure 3, the main thought of recommend method of the present invention is: (1) is for the expert info in science and technology item declaration management system and pending trial science and technology item information, main text dividing become to substring sequence and carry out the ICTCLAS of Chinese Academy of Sciences participle, word segmentation result being carried out to stop words filtration and obtain set of words; (2) science and technology item information comprises the information such as main research, technical indicator, quantity of information is larger, invention builds term network according to the cooccurrence relation of the semantic relation of word and word, and calculate the node rendezvous eigenwert of term network, with statistical characteristics weighted calculation word key degree, extract the Feature Words of each science and technology item; (3) expert info is simplified than science and technology item information, and quantity of information is less, and the set of words directly each expert info being obtained is after filtration as Feature Words; (4) according to the importance difference of science and technology item, expert's field information, field weight is set, the Feature Words obtaining according to (2) and (3), builds respectively the Knowledge Representation Model for project and expert, builds expert's index database; (5) grouping recommends expert model pending trial project knowledge to represent that model does the feature union operation between interfield and project, and single pending trial project expert recommends only to do interfield feature union operation.Expert Knowledge Expression model is done to interfield feature merges simultaneously.(6) consider the feature that word has Semantic fuzzy matching, calculate the similarity of expert info and pending trial science and technology item information, by setting threshold values, block the final expert of recommendation of generation list.

Step 3. science and technology item feature word extracts: utilize general inactive dictionary and the specialty dictionary of stopping using to carry out stop words filtration to participle, general inactive dictionary adopts Harbin Institute of Technology's vocabulary of stopping using, the word segmentation result of removal stop words as a set of words, referring to Fig. 4.

Described semantic similarity computation process is as follows:

SimSEM (W 1, W 2) = \max_{i = 1, . . . n . j = 1 . . . m} Sim (S_{1 i}, S_{2 i})

{Sim}_{4} (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} {Sim}_{i} (S_{1}, S_{2})

The cooccurrence relation computation process of described word is as follows:

When Cm is initial, Coo (Ci, Cj) is 01(1≤i, j≤m).

3) if T _ichinese, Coo (T _i-1, T _i+1) ++, i++, turns 1); Otherwise, turn 1).

4) if T _n-2be Chinese, turn 5); Otherwise, turn 7)

6) if T _nchinese, Coo (T _n-2, T _n) ++, finish; Otherwise finish.

Described term network is as follows:

{WD}_{i} = Σ_{j = 1}^{n} w_{ij} / n

C_{i} = \frac{T_{i}}{(\begin{matrix} D_{i} \\ 2 \end{matrix})} = 2 T_{i} / D_{i} (D_{i} - 1)

B_{i} = Σ_{v_{i}, w, x &Element; G} \frac{d (w, x; v_{i})}{d (w, x)}

Z_{i} = a \times {WD}_{i} + b \times C_{i} / Σ_{j = 1}^{n} C_{j} + c \times B_{i}

Wherein, a+b+c=1.

TFi = \frac{f (Wi)}{Σ_{j = 1}^{n} f (p_{j})}

Word W _ithe long weight leni of word in text is defined as:

For each word in sequence of terms, its statistical characteristics is

stats _i=A*TF _i+B*pos _i+C*len _i

Wherein, A+B+C=1.

Described word W _ithe computation process of key degree is as follows:

Imp _i=β*stats _i+（1-β）*Z _i

Wherein, 0 < β < 1.

Step 5. evaluation expert information index storehouse builds: after evaluating Expert Knowledge Expression model construction and completing, information index is put in storage: the content item information that first reads an evaluation expert from experts database; Based on word segmentation result, set up phrase semantic network and extract the Feature Words that evaluation expert comprises; According to Knowledge Representation Model and utilize Apache Lucene to set up index to it; The index establishing is added in corresponding index database by affiliated classification, until all evaluation expert's index warehouse-in, referring to Fig. 5.

(1) pending trial project, an evaluation expert's interfield feature merges

for:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{f ({word}_{1 i}) + f ({word}_{2 i})}{2}} | {word}_{1 i} = {word}_{2 j}}

Wherein, word _1i, word _2jfor Feature Words.

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{w 1 * f ({word}_{1 i}) + w 2 * f ({word}_{2 i})}{\sqrt{{w 1}^{2} + {w 2}^{2}}}} | {word}_{1 i} = {word}_{2 j}}

(2) between the project of grouping pending trial project, feature merges

be defined as:

V (d_{1}) &CirclePlus; V (d_{2}) = {< t_{k}, w_{k} (p) = \frac{w_{i} (d_{1}) + w_{j} (d_{2})}{2} >}

Wherein, k=1 ..., n, t _kfor feature entry item, w _k(p) be t _kweight.

The knowledge model of science and technology item group represents that the basic process producing is as follows:

V(p)＝{＜t ₁,w ₁(p)＞,＜t ₂,w ₂(p)＞,...,＜t _n,w _n(p)＞}

Described pending trial science and technology item (group) vector is as follows based on bigraph (bipartite graph) maximum matching algorithm computing semantic similarity computation process with evaluation expert's vector:

Based on maximum matching algorithm computing semantic similarity, the maximum matching algorithm similarity of the employing that obtains exactly two texts based on bigraph (bipartite graph).As shown in Figure 2, the similarity of the maximum matching algorithm calculated characteristics item based on bigraph (bipartite graph), its principle is exactly a summit using each Feature Words of science and technology item (group) vector as X portion, each Feature Words of evaluation expert's vector is as a summit of Y portion, be equivalent to the maximum weight matching of asking a complete bipartite graph, in accompanying drawing 2, thick line is exactly partly the semantic similarity of X portion feature word and certain Y portion Feature Words maximum.

So-called semantic similarity, the similarity based on knowing net is calculated and is obtained.The present invention is by knowing that net semantic dictionary and maximum matching algorithm calculate the semantic similarity between pending trial project (group) and evaluation expert, and computing formula is:

SimSEM (P, Q) = (Σ_{k = 1}^{p} \sqrt{f (s_{i}) * f (t_{j})} * SimSEM (s_{i}, t_{j})) / \min (m, n)

Wherein, s _i, t _jfor semantic similarity maximal value SimSEM (s _i, t _j) corresponding two the word nodes in limit (thick line in Fig. 2), m, n is respectively the Feature Words number of science and technology item vector representation and the Feature Words number of evaluation expert's vector representation.P is the number on the limit (thick line in Fig. 2) of semantic similarity maximum.

Above-mentioned pending trial project (group) relates to the many factors such as language, phrase semantic, word structure with the semantic similarity of evaluation expert's information, it represents both matching degrees, similarity is large, illustrates that both matching degrees are high, and evaluation expert is applicable to evaluation this project (group).

The above is only the preferred embodiment of the present invention; should be understood that; intelligent machine recommended technology for science and technology item evaluation expert field; do not departing under the prerequisite of the technology of the present invention principle; can also make some improvement and distortion, these improvement and distortion also should be considered as legal protection scope of the present invention.

Claims

1. towards evaluation expert's intelligent recommendation method of science and technology item, it is characterized in that the method comprises the following steps:

Step 1. dictionary of stopping using using the general term in science and technology item and expert info and habitual word as specialty; Using punctuation mark, non-Chinese character as cutting signature library;

Step 2. pair science and technology item information, expert info carry out participle: according to cutting mark in science and technology item information, project name, main research, technical indicator are cut into substring sequence; According to cutting mark in evaluation expert's information, the project that extraction expert info, prize-winning situation, invention situation, the situation that publishes thesis, problem were born and performance, research direction are cut into substring sequence, and a sub-string sequence is a field information; Utilize the ICTCLAS of Chinese Academy of Sciences antithetical phrase string sequence to carry out participle;

Step 3. science and technology item feature word extraction: utilize the inactive dictionary of general inactive dictionary and specialty to carry out stop words filtration to participle, described general inactive dictionary adopts the inactive vocabulary of Harbin Institute of Technology, using the word segmentation result of removing stop words as a set of words;

The structure of the inactive dictionary of specialty is constantly perfect process of a self study, the word frequency of constantly adding up word in information participle process, and the probability that word occurs at text is greater than certain threshold values, brings it into inactive dictionary;

Science and technology item quantity of information is larger, set of words is carried out to semantic similarity between word and calculate, and according to the cooccurrence relation of the semantic relation of word and word, builds term network, the word aggregation characteristic value in computational grid; Then in conjunction with the statistical characteristics of word, the crucial degree that calculates word extracts science and technology item feature word; The feature word of science and technology item is exactly statistical nature information and the semantic feature information of extracting comprehensive text, extracts more exactly feature word;

Step 4. evaluation expert feature word extracts: according to general inactive dictionary and the inactive dictionary of specialty, carry out stop words filtration, extract each expert's Feature Words set;

Step 5. builds science and technology item, evaluation expert's minute field Knowledge Representation Model: by vector space model and matter-element Knowledge Set model are expanded, according to the different field information in science and technology item, set up text representation model PRO=(id, F, WF, T, V), wherein id is illustrated in the identification field in project library; F represents field classification set in science and technology item; WF is the weight of field; T is feature word; V represents that the corresponding word of field and weight set thereof are V _i={ v _i1, f (v _i1), v _i2, f (v _i2) ..., v _in, f (v _in), v _ijrepresent j feature word in i field, f (v _ij) expression v _ijthe corresponding frequency of keyword; The representation of knowledge of science and technology item information is as follows:

In like manner, according to the different field information in expert, set up Knowledge Representation Model TM=(id, F, WF, T, V); Wherein, id is illustrated in the identification field in experts database; F represents field classification set in evaluation expert; WF is the weight set of field; T is feature word; V represents that the corresponding feature word of field and weight set thereof are V _i={ v _i1, f (v _i1), v _i2, f (v _i2) ..., v _in, f (v _in), v _ijrepresent j feature word in i field, f (v _ij) expression v _ijthe frequency of occurrences of feature word in corresponding field; The representation of knowledge of evaluation expert's information is:

Step 5. evaluation expert information index storehouse builds: after evaluating Expert Knowledge Expression model construction and completing, information index is put in storage: the content item information that first reads an evaluation expert from experts database; Based on word segmentation result, set up phrase semantic network and extract the Feature Words that evaluation expert comprises; According to Knowledge Representation Model and utilize Apache Lucene to set up index to it; The index establishing is added in corresponding index database by affiliated classification, until all evaluation expert's index warehouse-in;

Step 6: according to the number of project, the way of recommendation is divided into single pending trial project recommendation expert and grouping pending trial project recommendation expert; Grouping recommends expert to represent that to the pending trial project knowledge of step 5 model does the feature union operation between corresponding interfield and project, and single pending trial expert recommends only to do corresponding interfield feature union operation; Meanwhile, the evaluation expert's of step 5 Knowledge Representation Model is carried out to the merging of interfield feature; According to Knowledge Representation Model and utilize the characteristic information after Apache Lucene is combined to set up index; Wherein, science and technology item index construct carries out when carrying out project recommendation;

In science and technology item declaration management system, pending trial project needs grouping to recommend often, above-mentioned feature union operation not only guarantee not can removal process 5 in Knowledge Representation Model different field weight is set similarity is calculated and produced the contribution difference of recommending;

Step 7. merges through the interfield feature of the evaluation expert of step 6 and the Knowledge Representation Model of science and technology item, if suppose, evaluation expert's information vector is expressed as P={s ₁, f (s ₁), s ₂, f (s ₂) ..., s _n, f (s _n), science and technology item information vector is expressed as Q={t ₁, f (t ₁), t ₂, f (t ₂) ..., t _n, f (t _n), the semantic similarity based on maximum matching algorithm calculating pending trial science and technology item vector with evaluation expert;

2. a kind of evaluation expert's intelligent recommendation method towards science and technology item according to claim 1, is characterized in that: the semantic similarity computation process described in step 3 is as follows:

In knowing net semantic dictionary, if for two word W ₁and W ₂, W ₁there is n concept: S11, S12 ..., S1n, W ₂there is m concept: S21, S22 ..., S2n; Word W ₁and W ₂similarity SimSEM (W1, W2) equal the maximal value of the similarity of each concept:

SimSEM (W 1, W 2) = \max_{i = 1, . . . n . j = 1 . . . m} Sim (S_{1 i}, S_{2 i})

Notional word and function word have different descriptive languages, need to calculate the adopted similarity between former of the former or relation of its corresponding syntax justice; Notional word concept comprises that the first basic meaning is former, other basic meanings are former, the adopted former description of relation, relational symbol are described, and similarity is designated as respectively Sim1 (p ₁, p ₂), Sim2 (p ₁, p ₂), Sim3 (p ₁, p ₂), Sim4 (p ₁, p ₂); The similarity calculating of two feature structures finally reverts to basic meaning similarity former or concrete word and calculates;

{Sim}_{4} (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} {Sim}_{i} (S_{1}, S_{2})

β _i(1≤i≤4) are adjustable parameters, and have: β ₁+ β ₂+ β ₃+ β ₄=1, β ₁>=β ₂>=β ₃>=β ₄;

Wherein, Sim (C ₁, C ₂) be word C ₁with word C ₂semantic similarity, Sim (C _i, C _i) be 1, Sim (C _i, C _j)=Sim (C _j, C _i);

Set of words CW={C1, C2 ..., Cm} calculates the value of similarity between p* (1+p)/2 word through semantic similarity;

The cooccurrence relation computation process of described word is as follows:

Word co-occurrence patterns are one of important models of the natural language processing research field based on statistical method; According to word co-occurrence patterns, if two frequent co-occurrences of word in the same window unit of document, these two words are to be mutually related in meaning, they express the semantic information of the text to a certain extent; Utilizing moving window to carry out word co-occurrence degree to the word in sequence of terms calculates:

First, sequence of terms is carried out to word extraction, remove space, null and merge identical word, obtains set of words CW={C1, C2 ..., Cm}, wherein m≤n;

When Cm is initial, Coo (Ci, Cj) is 01(1≤i, j≤m);

1) if i=n-1 turns 4); If T _i-1be space or null, moving window slides to next word, i++; Otherwise, turn 2);

2) if T _ifor Chinese, Coo(T _i-1, T _i) ++, turn 3); If T _ifor null, turn 3); Otherwise turn 1);

3) if T _ichinese, Coo (T _i-1, T _i+1) ++, i++, turns 1); Otherwise, turn 1);

4) if T _n-2be Chinese, turn 5); Otherwise, turn 7)

5) if T _n-1chinese, Coo (T _n-2, T _n-1) ++, turn 6); If T _n-1be space, turn 6); Otherwise finish;

6) if T _nchinese, Coo (T _n-2, T _n) ++, finish; Otherwise finish;

7) if T _n-1chinese, and T _nalso be Chinese, Coo (T _n-1, T _n) ++, finish; Otherwise finish;

Through the calculating of step above, obtain word co-occurrence degree Matrix C m, and each element of Cm is normalized, namely each element is divided by the maximal value of all elements in matrix, i.e. max{Coo (C _i, C _j) | 1≤i, j≤m};

Described term network is as follows:

Wherein, α is that 0.3, β is 0.7, and the semantic relation between strengthening word, weakens the cooccurrence relation between word;

W _mas the adjacency matrix corresponding to term network of input, its corresponding network chart is defined as: G={V, E}; Wherein scheming G is undirected weighted graph, the vertex set in V presentation graphs G, and E represents the limit collection in G, v _irepresent i summit (word) in V;

Key character degree of the having distribution of term network, average shortest path, concentration class and convergence factor; The degree of node embodies the associated situation of this node and other node; The node that the concentration class of node and convergence factor are embodied in this node subrange interconnects density; The degree of node and convergence factor embody the importance of this node in subrange; The aggregation characteristic value that measures and weights, convergence factor and node betweenness are carried out computing node that adds by node, can allow important word give higher weights, and guaranteeing again also has higher scoring with the related word of many important words;

{WD}_{i} = Σ_{j = 1}^{n} w_{ij} / n

Wherein, w _ijfor node v _iwith v _jbetween weights on limit, total number that n is node;

C_{i} = \frac{T_{i}}{(\begin{matrix} D_{i} \\ 2 \end{matrix})} = 2 T_{i} / D_{i} (D_{i} - 1)

In semantic similarity network chart, node betweenness Betweenness is between node x and w and shortest path passes through node v _ipossibility probability; Pair Analysis between two nonneighbor nodes depends on the node on the shortest path that connects point-to-point transmission, the potential role who controls interactive information stream between node, the B of playing the part of of these nodes _iembody node v _idegree of connecting under local environment, node betweenness Betweenness is defined as:

B_{i} = Σ_{v_{i}, w, x &Element; G} \frac{d (w, x; v_{i})}{d (w, x)}

D (w, x) represents shortest path number between any two node w and x, d (w, x; v _i) represent any two node w and x and pass through v _ishortest path number;

Z_{i} = a \times {WD}_{i} + b \times C_{i} / Σ_{j = 1}^{n} C_{j} + c \times B_{i}

Wherein, a+b+c=1;

Adopt nonlinear function to be normalized word frequency; Word W _iword frequency weight TFi in text is defined as:

TFi = \frac{f (Wi)}{Σ_{j = 1}^{n} f (p_{j})}

Wherein, TFi represents word W _iword frequency weight, p _jrepresent certain word in text, f is word frequency statistics function;

Word W _ipart of speech weight posi in text is defined as:

Word is more long more can reflect concrete information, otherwise the represented meaning of shorter word is conventionally more abstract; Especially mostly the feature word in document is the academic combination of some specialties vocabulary, and length is longer, and its implication is clearer and more definite, more can reflect text subject; The weight that increases long word, is conducive to vocabulary to be cut apart, thereby reflects more accurately the significance level of word in document;

Word W _ithe long weight leni of word in text is defined as:

For each word in sequence of terms, its statistical characteristics is

stats _i=A*TF _i+B*pos _i+C*len _i

Wherein, A+B+C=1;

Described word W _ithe computation process of key degree is as follows:

Imp _i=β*stats _i+（1-β）*Z _i

Wherein, 0 < β < 1;

3. a kind of evaluation expert's intelligent recommendation method towards science and technology item according to claim 1, is characterized in that: it is as follows that the feature merga pass logic xor operation described in step 6 carries out process:

(1) pending trial project, an evaluation expert's interfield feature merges

for:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{f ({word}_{1 i}) + f ({word}_{2 i})}{2}} | {word}_{1 i} = {word}_{2 j}}

Wherein, word _1i, word _2jfor Feature Words;

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{w 1 * f ({word}_{1 i}) + w 2 * f ({word}_{2 i})}{\sqrt{{w 1}^{2} + {w 2}^{2}}}} | {word}_{1 i} = {word}_{2 j}}

(2) between the project of grouping pending trial project, feature merges

This merging process operation is only for the proper vector of pending trial science and technology item, and not for evaluation expert's proper vector, expert's proper vector only need to be done interfield feature union operation; If V is (d ₁) and V (d ₂) be respectively the vector models of two science and technology items after interfield feature merges, to any t _1j∈ V (d ₁), t _2j∈ V (d ₂), if there is t _1jwith t _2jidentical merging;

be defined as:

V (d_{1}) &CirclePlus; V (d_{2}) = {< t_{k}, w_{k} (p) = \frac{w_{i} (d_{1}) + w_{j} (d_{2})}{2} >}

Wherein, k=1 ..., n, t _kfor feature entry item, w _k(p) be t _kweight;

The basic process that Knowledge Representation Model produces is as follows:

by above-mentioned method, science and technology item is set up to the Knowledge Representation Model of the vector space that is based on;

V(p)＝{＜t ₁,w ₁(p)＞,＜t ₂,w ₂(p)＞,...,＜t _n,w _n(p)＞}