CN103631859B

CN103631859B - Intelligent review expert recommending method for science and technology projects

Info

Publication number: CN103631859B
Application number: CN201310509358.2A
Authority: CN
Inventors: 徐小良; 吴仁克; 林建海; 陈秋
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2017-01-11
Anticipated expiration: 2033-10-24
Also published as: CN103631859A

Abstract

The invention provides an intelligent review expert recommending method for science and technology projects. The method includes the following steps that (1) the science and technology projects to be reviewed and expert information main texts are segmented into substring sequences, ICTCLAS segmentation of Chinese academy of sciences is carried out on the substring sequences, and stop word filtering is carried out on a segmentation result to obtain a term set; (2) a term network of project information is built, feature words are extracted on the basis of statistical characteristics and aggregation characteristics, and if expert information is relatively concise, the term set obtained in the step (1) directly serves as the feature words; (3) a knowledge representation model is built on the basis of fields and weights of the feature words, and a relative information index is built; (4) experts are recommended in groups to carry out feature merging operations between the fields and between the projects on the knowledge representation model; (5) similarity of the experts and the science and technology projects or groups to be viewed is calculated on the basis of semantics, threshold truncation is set, and a final recommended expert list is generated. By means of the method, the problems that recommending workload is large and review decisions lack scientificity are greatly alleviated.

Description

A kind of evaluation expert's intelligent recommendation method towards science and technology item

Technical field

The invention belongs to expert's recommended technology field, particularly relate to a kind of science and technology item evaluation expert based on network service Intelligent recommendation method, it is a kind of intelligent method assisting science and technology item Authorize to Invest.

Background technology

Along with technological project management system is popularized rapidly in each functional department of China, the evaluation of science and technology item was from the past Concentration conference model develop into current network schemer, broken the restriction of expert region in evaluation.Evaluation expert's root According to domain knowledge and the subsidy standard of subsidy mechanism, appraiseing project application book through discussion, that subsidizes mechanism foundation expert appraises feelings through discussion Condition decides whether to subsidize.

At present the expert towards science and technology item recommends the most only subjective consciousness with project manager to recommend expert to treat Careful project is evaluated, and a pending trial project generally requires multiple expert and evaluates, and artificial recommendation expert certainly will exist efficiency The problems such as the highest, workload big, shortage science, the expert selected is not most suitable.Therefore, science and technology item is commented Examining the research that expert intelligence recommends is non-the normally off key, can effectively alleviate expert and not mate with the commented contents of a project etc. and to ask Topic, is greatly promoted the community service ability of science and technology item evaluation.

Intelligent recommendation technology now, such as collaborative filtering recommending, content-based recommendation etc., mostly applies and recommends net at video display Stand, commercial product recommending website, rarely have the research in science and technology item evaluation expert's information bank and application, due to the limit of specific area System, for science and technology item intelligent recommendation expert's technology and general recommended technology or distinguishing: first, technological project management system The recommendation of system relates to all trades and professions, and domain knowledge is extremely complex；Secondly, the recommendation of science and technology item evaluation expert relates to science and technology item Purpose sustentation fund, the requirement to objectivity, fairness and accuracy that expert recommends is the highest.The most in this respect, China also lacks systematized guide for method and ripe technical support.And information text has features such as " semi-structured ", specially The content of family's information and pending trial science and technology item information may be matched, and the present invention makes full use of architectural feature and word Semantic information calculates the information similarity of project and expert.If similarity is higher, then it represents that this project is familiar with by expert, and generation pushes away Recommend specialist list project is evaluated.Present invention simultaneously provides a kind of decision support system recommending evaluation expert for science and technology item System (Decision Support System, DSS), is assigned to evaluation expert the project that domain knowledge matches and carries out science Evaluation so that auxiliary expert (decision-making user) realizes the decision-making of science, and aid decision making user improves level of decision-making and quality, makes to comment Examine more scientific and objectivity.

Summary of the invention

The present invention is directed to the deficiencies in the prior art, it is provided that a kind of evaluation expert intelligent recommendation side towards science and technology item Method.

The present invention comprises the steps: towards evaluation expert's recommendation process of science and technology item

Step 1. disables dictionary using the general term in science and technology item and expert info and usual word as specialty；Punctuate is accorded with Number, non-Chinese character is as cutting signature library.

Step 2. carries out participle to science and technology item information, expert info: according to cutting labelling in science and technology item information, by item The information such as mesh title, main research, technical specification are cut into substring sequence；According to cutting labelling in evaluation expert's information, Project that extraction expert info, prize-winning situation, invention situation, the situation that publishes thesis, problem undertook and performance, research side To etc. information be cut into substring sequence, that is one field information of a sub-string sequence；Utilize Chinese Academy of Sciences ICTCLAS antithetical phrase string sequence Carry out participle.

Step 3. science and technology item feature word extracts: utilizes and general disables dictionary and specialty disables dictionary and stops participle Word is filtered, and the general dictionary that disables uses Harbin Institute of Technology to disable vocabulary, using the word segmentation result removing stop words as a word collection Close.

It is the most perfect process of a self study that specialty disables the structure of dictionary, constantly adds up during information participle The word frequency of word, the probability that word occurs at text is more than certain threshold value, it is brought into and disable dictionary.

Science and technology item quantity of information is relatively big, and set of words is carried out Semantic Similarity Measurement between word, according to the semantic pass of word The cooccurrence relation of system and word builds term network, calculates the word aggregation characteristic value in network；Statistics then in conjunction with word is special Value indicative, the criticality calculating word extracts science and technology item feature word；The feature word of science and technology item extracts comprehensively exactly The statistical nature information of text and semantic feature information, more accurately extract feature word.

Described Semantic Similarity Measurement process is as follows:

In knowing net semantic dictionary, if for two word W₁And W₂, W₁There is a n concept: S11, S12 ..., S1n, W₂ There is a m concept: S21, S22 ..., S2m.Word W₁And W₂Similarity SimSEM (W1, W2) equal to the similarity of each concept Maximum:

S i m S E M (W 1, W 2) = \max_{i = 1, ... n . j = 1... m} S i m (S_{1 i}, S_{2 j})

Notional word and function word have different description language, need to calculate the syntax justice of its correspondence is former or relation adopted former between Similarity.Notional word concept includes that the first basic meaning is former, other basic meanings are former, the adopted former description of relation, relational symbol describe, similarity It is designated as Sim1 (p respectively₁,p₂)、Sim2(p₁,p₂)、Sim3(p₁,p₂)、Sim4(p₁,p₂).The Similarity Measure of two feature structures Finally revert to the Similarity Measure of the former or concrete word of basic meaning.

{Sim}_{4} (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} {Sim}_{i} (S_{1}, S_{2})

β_i(1≤i≤4) are adjustable parameters, and have: β₁+β₂+β₃+β₄=1, β₁≥β₂≥β₃≥β₄。

If CW={C1, C2 ..., Cm} be process after the set of words that obtains, the semantic similarity adjacency matrix of its correspondence S_mIt is defined as:

Wherein, Sim (C₁,C₂) it is word C₁With word C₂Semantic similarity, Sim (C_i,C_i) it is 1, Sim (C_i,C_j)=Sim (C_j,C_i)。

Set of words CW={C1, C2 ..., Cm} is calculated m × (1+m)/2 word through semantic similarity Between the value of similarity.

It is as follows that the cooccurrence relation of described word calculates process:

Word co-occurrence model is one of important models of natural language processing research field based on statistical method.According to word altogether Existing model, if two frequent co-occurrences of word are at the same window unit (such as a word, a paragragh etc.) of document, the two word exists Being to be mutually related in meaning, they express the semantic information of the text to a certain extent.Utilize sliding window (sliding window A length of 3) word in sequence of terms is carried out word co-occurrence degree calculate, sliding window as shown in Figure 1:

First, sequence of terms is carried out word extraction, i.e. remove space, null and merge identical word, obtain word Set CW={C1, C2 ..., Cm}, wherein m≤n.

Word co-occurrence degree Matrix C m corresponding to set of words CW is defined as:

When Cm is initial, Coo (Ci, Cj) is 01 (1≤i, j≤m).

By sliding window, sequence of terms being carried out word co-occurrence degree to calculate, the word in sliding window is T_i-1T_iT_i+1(1<i < n):

1) if i=n-1,4 are turned)；If T_i-1Being space or null, sliding window slides to next word, i++；Otherwise, 2 are turned).

2) if T_iFor Chinese, then Coo (T_i-1,T_i) ++, turn 3)；If T_iFor null, turn 3)；Otherwise turn 1).

3) if T_iChinese, then Coo (T_i-1,T_i+1) ++, i++, turn 1)；Otherwise, 1 is turned).

4) if T_n-2It is Chinese, turns 5)；Otherwise, 7 are turned)

5) if T_n-1It is Chinese, Coo (T_n-2,T_n-1) ++, turn 6)；If T_n-1It is space, turns 6)；Otherwise terminate.

6) if T_nIt is Chinese, Coo (T_n-2,T_n) ++, terminate；Otherwise terminate.

7) if T_n-1It is Chinese, and T_nAlso be Chinese, then Coo (T_n-1,T_n) ++, terminate；Otherwise terminate.

Through the calculating of previous step, obtain word co-occurrence degree Matrix C m, and each element of Cm is normalized Processing, namely each element is divided by the maximum of all elements in matrix, i.e. max{Coo (C_i,C_j)|1≤i,j≤m}。

Described term network is as follows:

When building cum rights term network, first having to obtain the weight matrix of term network, definition weight matrix Wm is:

Wherein, α is 0.3, and β is 0.7, the semantic relation between strengthening word, weakens the cooccurrence relation between word.

W_mAs the adjacency matrix that the term network of input is corresponding, then the network of its correspondence is defined as: G={V, E}；Its Middle figure G is undirected weighted graph, and V represents the vertex set in figure G, and E represents the limit collection in G, v_iRepresent i-th summit (word) in V.

The calculating process of described word aggregation characteristic value is as follows:

The distribution of key character degree of having, average shortest path length, concentration class and the convergence factor of term network.The degree of node embodies This node associates situation with other node.The concentration class of node and convergence factor are embodied in the node in this node subrange It is connected with each other density.The degree of node and convergence factor embody this node importance in subrange.The present invention passes through node Add measures and weights, convergence factor and node betweenness to calculate the aggregation characteristic value of node, important word can be allowed to give higher Weights, ensure that again the word related word important with many also has higher scoring.

In semantic similarity network, unordered couple (v_i,v_j) represent node v_iWith v_jBetween limit, then node v_i The definition adding measures and weights be:

{WD}_{i} = Σ_{j = 1}^{n} w_{i j} / n

Wherein, w_ijFor node v_iWith v_jBetween weights on limit, n is total number of node.

In semantic similarity network, unordered couple (v_i,v_j) represent node v_iWith v_jBetween limit, node v_i's Non-power measures and weights D_iFor D_i=| { (v_i,v_j):(v_i,v_j)∈E,v_i,v_j∈V}|；Node v_iConcentration class K_iFor existing between neighbor node Actual limit number: T_i=| { (v_j,v_k):(v_i,v_k)∈E,(v_j,v_k)∈E,v_i,v_j∈ V} |, then node v_jConvergence factor C_i's It is defined as:

C_{i} = \frac{T_{i}}{(\begin{matrix} D_{i} \\ 2 \end{matrix})} = 2 T_{i} / D_{i} (D_{i} - 1)

In semantic similarity network, node betweenness Betweenness is between node x and w and shortest path leads to Cross node v_iProbability probability.Pair Analysis between two nonneighbor nodes depends on the joint on the shortest path connecting point-to-point transmission Point, these nodes are potential plays the part of the role of interactive information stream, B between control node_iEmbody node v_iConnecting under local environment Degree, then the definition of node betweenness Betweenness is:

B_{i} = \underset{w &Element; G, x &Element; G}{Σ} \frac{r_{v i} (w, x)}{d (w, x)}

D (w, x) represents in cum rights semantic similarity network shortest path number between any two node w and x,Represent any two node w and x and through v_i(v_i∈ G) shortest path number.

By node v_iAverage weighted degree, convergence factor and betweenness Betweenness be weighted comprehensively weighing node Aggregation characteristic value, node v_iAggregation characteristic value Z_iDefinition be:

Z_{i} = a \times {WD}_{i} + b \times C_{i} / Σ_{j = 1}^{n} C_{j} + c \times B_{i}

Wherein, a+b+c=1.

The calculating process of the statistical characteristics of described word is as follows:

Use nonlinear function that word frequency is normalized.Word W_iWord frequency weight TFi in the text is defined as:

T F i = \frac{f (W i)}{Σ_{j = 1}^{n} f (p_{j})}

Wherein, TFi represents word W_iWord frequency weight, p_jRepresenting certain word in text, f is word frequency statistics function.

Can identify text characteristics in Chinese text is usually notional word, such as noun, verb, adjective etc..And interjection, Jie Feature word, to determining that text categories is the most nonsensical, can be extracted and bring the biggest interference by the function word such as word, conjunction.Word W_i? Part of speech weight posi in text is defined as:

Word is the longest more can reflect concrete information, otherwise, the represented meaning of shorter word is the most abstract.Especially at literary composition Mostly feature word in Dang is some specialties academic combination vocabulary, and length is longer, and its implication is clearer and more definite, more can reflect text master Topic.Increase the weight of long word, be conducive to vocabulary is split, thus reflect word important journey in a document more accurately Degree.

Word W_iLong weight leni of word in the text is defined as:

For each word in sequence of terms, its statistical characteristics is

stats_i=A*TF_i+B*pos_i+C*len_i

Wherein, A+B+C=1.

Described word W_iThe calculating process of criticality is as follows:

Corresponding to each node in weighting term network, its crucial angle value Imp_iIt is defined as:

Imp_i=β * stats_i+(1-β)*Z_i

Wherein, 0 ＜ β ＜ 1.

The value of criticality will be obtained by calculating, and sort from big to small, set a threshold gamma (0 ＜ γ ＜ 1), before taking-up The value of q, then these words are using the feature word as science and technology item, and these words fully reflect theme, and are heavier The word wanted.

Step 4. evaluation expert's feature word extracts: evaluation expert's quantity of information is few compared with science and technology item information, science and technology item Feature Words builds network and based on statistical nature and the extractive technique of semantic feature, is not suitable for the feature word of evaluation expert's information Extract, directly disable dictionary and specialty disables dictionary and carries out stop words filtration according to general, extract the feature word set of each expert Closing, general to disable dictionary and be also be to use Harbin Institute of Technology to disable vocabulary, and specialty disables dictionary needs personnel constantly to safeguard.

Step 5. builds point field Knowledge Representation Model of science and technology item, evaluation expert: by vector space model and Matter-element Knowledge Set model is extended, according to the different field information in science and technology item set up text representation model PRO=(id, F, WF, T, V), wherein id represents the identification field in project library；F represents field category set in science and technology item；WF is field Weight；T is characterized word；V represents the word corresponding to field and weight set i.e. V thereof_i={ v_i1,f(v_i1),v_i2,f (v_i2),...,v_in,f(v_in), v_ijRepresent the jth feature word in i-th field, f (v_ij) represent v_ijCorresponding to key word Frequency.The representation of knowledge of science and technology item information is as follows:

In like manner, Knowledge Representation Model TM=(id, F, WF, T, V) is set up according to the different field information in expert.Wherein, Id represents the identification field in experts database；F represents field category set in evaluation expert；WF is the weight set of field；T is Feature word；V represents the feature word corresponding to field and weight set i.e. V thereof_i={ v_i1,f(v_i1),v_i2,f(v_i2),..., v_in,f(v_in), v_ijRepresent the jth feature word in i-th field, f (v_ij) represent v_ijFeature word is at corresponding word The frequency of occurrences in Duan.The representation of knowledge of evaluation expert's information is:

Evaluation expert's information index storehouse builds: after evaluation Expert Knowledge Expression model construction completes, entered by information index Storehouse: first read the content item information of an evaluation expert from experts database；Phrase semantic network is set up also based on word segmentation result Extract the Feature Words that evaluation expert is comprised；Foundation Knowledge Representation Model also utilizes Apache Lucene that it is set up index；Will The index established is added in the index database of correspondence by generic, until all of evaluation expert indexes warehouse-in.

Step 6: according to the number of project, it is recommended that mode is divided into single pending trial project recommendation expert and packet (multiple) pending trial Project recommendation expert.Packet recommends expert that the pending trial project knowledge of step 5 being represented, model does between corresponding interfield and project Feature union operation, single pending trial expert recommend only do corresponding interfield feature union operation.Meanwhile, commenting step 5 The Knowledge Representation Model examining expert carries out interfield feature merging.Foundation Knowledge Representation Model also utilizes Apache Lucene couple Characteristic information after merging sets up index.Wherein, science and technology item index construct is carried out when carrying out project recommendation.

In science and technology item declaration management system, pending trial project needs packet to recommend often, features described above union operation, Guarantee removal process 5 to arrange different field weight that Similarity Measure is produced the contribution recommended is poor by Knowledge Representation Model Different.

Described pending trial project, that the feature merga pass logic xor operation of evaluation expert carries out process is as follows:

(1) pending trial project, the interfield feature of an evaluation expert merge

Assume field feature set of words W'₁And W'₂Merge, then define W'₁And W'₂Merge ruleFor:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{f ({word}_{1 i}) + f ({word}_{2 i})}{2}} | {word}_{1 i} = {word}_{2 j}}

Wherein, word_1i, word_2jIt is characterized word.

Add field weight improve and extend above-mentioned definition, the interfield feature of evaluation expert, science and technology item is closed And, merging rule is:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{w 1 * f ({word}_{1 i}) + w 2 * f ({word}_{2 i})}{\sqrt{w 1^{2} + w 2^{2}}}} | {word}_{1 i} = {word}_{2 j}}

(2) between the project of packet pending trial project, feature merges

The operation of this merging process, just for the characteristic vector of pending trial science and technology item, is not for evaluation expert's characteristic vector, Expert features vector has only to do interfield feature union operation.If V is (d₁) and V (d₂) it is that two science and technology items are through word respectively Vector model after the merging of intersegmental feature, to any t_1j∈V(d₁), t_2j∈V(d₂), if there is t_1jWith t_2jIdentical, merge.It is defined as:

V (d_{1}) &CirclePlus; V (d_{2}) = {< t_{k}, w_{k} (p) = \frac{w_{i} (d_{1}) + w_{j} (d_{2})}{2} >}

Wherein, k=1 ..., n, t_kIt is characterized entry item, w_kP () is t_kWeight.

The basic process that the Knowledge Representation Model of science and technology item group produces is as follows:

A). merge science and technology item interfield feature, obtain vector model V (d) of each project；

B). all science and technology item vector model set are used consolidation strategyLogical Cross above-mentioned method, science and technology item is set up the Knowledge Representation Model of the vector space that is based on.

V (p)={ ＜ t₁,w₁(p) ＞, ＜ t₂,w₂(p) ＞ ..., ＜ t_n,w_n(p) ＞ }

Wherein, k=1 ..., n, t_kFor project team's Feature Words entry item, w_kP () is t_kWeight.

Step 7. is closed through the interfield feature of the evaluation expert of step 6 and the Knowledge Representation Model of science and technology item And, it is assumed that if evaluation expert's information vector is expressed as P={s₁,f(s₁),s₂,f(s₂),...,s_n,f(s_n), science and technology item information (group) vector representation is Q={t₁,f(t₁),t₂,f(t₂),...,t_n,f(t_n), calculate pending trial science and technology based on maximum matching algorithm Project (group) vector and the semantic similarity of evaluation expert.

Step 8. arranges similarity and blocks, and produces according to the size of similarity and recommends index, produces final recommendation evaluation Specialist list.

The present invention has the beneficial effect that:

Can recommend quickly and conveniently, intelligently, accurately science and technology project appraisal expert；Can significantly alleviate science and technology item The mesh declaration management system scientific worker distribution task to evaluation expert, reduces the cost of management；Ensure that evaluation Expert and pending trial science and technology item have higher field matching degree, it is ensured that the evaluation of project is accomplished objectivity, public affairs by evaluation expert Positivity and science, it is provided that automatic, efficient, just decision support, it is to avoid science and technology item examination & approval occur human feelings network of personal connections, The problem that the examination & approval such as " Matthew effect " are improper.

Accompanying drawing explanation

Fig. 1 is to carry out word co-occurrence degree in the present invention to calculate sliding window.

Fig. 2 is maximum matching algorithm principle schematic based on bigraph (bipartite graph) in the present invention.

Fig. 3 is the evaluation expert's intelligent recommendation method flow diagram in the present invention towards science and technology item.

Fig. 4 is the extraction flow chart of the Feature Words of science and technology item and evaluation expert's information in the present invention.

Fig. 5 is that in the present invention, evaluation expert's knowledge index storehouse builds flow chart.

Detailed description of the invention

The invention will be further described below in conjunction with the accompanying drawings, it should be emphasised that be that the description below is merely exemplary, Rather than in order to limit the scope of the present invention and application thereof.Hereinafter the detailed description of the invention of the present invention is described in further detail, base Embodiment in invention, the every other enforcement that those of ordinary skill in the art are obtained under not having creative work premise Example, broadly falls into protection scope of the present invention.

As it is shown on figure 3, the main thought of the recommendation method of the present invention is: (1) is in science and technology item declaration management system Expert info and pending trial science and technology item information, main text dividing is become substring sequence and carries out Chinese Academy of Sciences's ICTCLAS participle, Word segmentation result is carried out stop words and is filtrated to get set of words；(2) science and technology item information includes main research, technical specification Etc. information, quantity of information is relatively big, and the cooccurrence relation inventing the semantic relation according to word and word builds term network, and calculates word net The node rendezvous eigenvalue of network, with statistical characteristics weighted calculation word criticality, extracts the Feature Words of each science and technology item； (3) expert info is than science and technology item information reduction, and quantity of information is less, directly by each expert info word collection through being filtrated to get Cooperation is characterized word；(4) field weight is set according to the importance difference of science and technology item, expert's field information, according to (2) and (3) Feature Words obtained, builds the Knowledge Representation Model for project and expert respectively, builds expert's index database；(5) packet pushes away Recommend expert model pending trial project knowledge and represent that model is the feature union operation between interfield and project, single pending trial project expert Recommend only to do interfield feature union operation.Expert Knowledge Expression model does interfield feature simultaneously merge.(6) consider Word has the feature of Semantic fuzzy matching, calculates the similarity of expert info and pending trial science and technology item information, by setting threshold Value blocks the consequently recommended specialist list of generation.

Step 3. science and technology item feature word extracts: utilizes and general disables dictionary and specialty disables dictionary and stops participle Word is filtered, and the general dictionary that disables uses Harbin Institute of Technology to disable vocabulary, using the word segmentation result removing stop words as a word collection Close, see Fig. 4.

Described Semantic Similarity Measurement process is as follows:

S i m S E M (W 1, W 2) = \max_{i = 1, ... n . j = 1... m} S i m (S_{1 i}, S_{2 j})

{Sim}_{4} (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} {Sim}_{i} (S_{1}, S_{2})

When Cm is initial, Coo (Ci, Cj) is 01 (1≤i, j≤m).

4) if T_n-2It is Chinese, turns 5)；Otherwise, 7 are turned)

6) if T_nIt is Chinese, Coo (T_n-2,T_n) ++, terminate；Otherwise terminate.

Described term network is as follows:

{WD}_{i} = Σ_{j = 1}^{n} w_{i j} / n

C_{i} = \frac{T_{i}}{(\begin{matrix} D_{i} \\ 2 \end{matrix})} = 2 T_{i} / D_{i} (D_{i} - 1)

B_{i} = \underset{w &Element; G, x &Element; G}{Σ} \frac{r_{v i} (w, x)}{d (w, x)}

By node v_iAverage weighted degree, convergence factor and betweenness Betweenness be weighted comprehensively weighing node Aggregation characteristic value, aggregation characteristic value Z of node vi_iDefinition be:

Z_{i} = a \times {WD}_{i} + b \times C_{i} / Σ_{j = 1}^{n} C_{j} + c \times B_{i}

Wherein, a+b+c=1.

T F i = \frac{f (W i)}{Σ_{j = 1}^{n} f (p_{j})}

Word W_iLong weight leni of word in the text is defined as:

For each word in sequence of terms, its statistical characteristics is

stats_i=A*TF_i+B*pos_i+C*len_i

Wherein, A+B+C=1.

Described word W_iThe calculating process of criticality is as follows:

Imp_i=β * stats_i+(1-β)*Z_i

Wherein, 0 ＜ β ＜ 1.

Evaluation expert's information index storehouse builds: after evaluation Expert Knowledge Expression model construction completes, entered by information index Storehouse: first read the content item information of an evaluation expert from experts database；Phrase semantic network is set up also based on word segmentation result Extract the Feature Words that evaluation expert is comprised；Foundation Knowledge Representation Model also utilizes Apache Lucene that it is set up index；Will The index established is added to by generic, in the index database of correspondence, until all of evaluation expert indexes warehouse-in, see Fig. 5.

(1) pending trial project, the interfield feature of an evaluation expert merge

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{f ({word}_{1 i}) + f ({word}_{2 i})}{2}} | {word}_{1 i} = {word}_{2 j}}

Wherein, word_1i, word_2jIt is characterized word.

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{w 1 * f ({word}_{1 i}) + w 2 * f ({word}_{2 i})}{\sqrt{w 1^{2} + w 2^{2}}}} | {word}_{1 i} = {word}_{2 j}}

(2) between the project of packet pending trial project, feature merges

V (d_{1}) &CirclePlus; V (d_{2}) = {< t_{k}, w_{k} (p) = \frac{w_{i} (d_{1}) + w_{j} (d_{2})}{2} >}

Wherein, k=1 ..., n, t_kIt is characterized entry item, w_kP () is t_kWeight.

The knowledge model of science and technology item group represents that the basic process of generation is as follows:

V (p)={ ＜ t₁,w₁(p) ＞, ＜ t₂,w₂(p) ＞ ..., ＜ t_n,w_n(p) ＞ }

Described pending trial science and technology item (group) vector calculates language with evaluation expert's vector based on bigraph (bipartite graph) maximum matching algorithm Justice Similarity Measure process is as follows:

Based on maximum matching algorithm computing semantic similarity, it is simply that obtain two texts uses maximum based on bigraph (bipartite graph) Matching algorithm similarity.As in figure 2 it is shown, maximum matching algorithm based on bigraph (bipartite graph) calculates the similarity of characteristic item, its principle is just It is that each Feature Words of evaluation expert's vector is made as the summit in X portion using each Feature Words of science and technology item (group) vector For a summit in Y portion, being equivalent to ask the maximum weight matching of a complete bipartite graph, in accompanying drawing 2, thick line portion is exactly X portion feature The semantic similarity that word is maximum with certain Y portion Feature Words.

So-called semantic similarity, it is simply that obtain based on the Similarity Measure knowing net.The present invention is by knowing net semantic dictionary With maximum matching algorithm calculates the semantic similarity between pending trial project (group) and evaluation expert, then computing formula is:

S i m S E M (P, Q) = (Σ_{k = 1}^{p} \sqrt{f (s_{i}) * f (t_{j})} * S i m S E M (s_{i}, t_{j})) / \min (m, n)

Wherein, s_i, t_jFor semantic similarity maximum SimSEM (s_i,t_j) two words corresponding to limit (thick line in Fig. 2) Language node, m, n are respectively the Feature Words number of science and technology item vector representation and the Feature Words number of evaluation expert's vector representation.p Number for the limit (thick line in Fig. 2) of semantic similarity maximum.

Above-mentioned pending trial project (group) relates to language, phrase semantic, word knot with the semantic similarity of evaluation expert's information The many factors such as structure, it represents both matching degrees, and similarity is big, illustrate that both matching degrees are high, and evaluation expert is suitable for evaluating This project (group).

The above is only the preferred embodiment of the present invention, it is noted that for science and technology item evaluation expert field Intelligent machine recommended technology, on the premise of without departing from the technology of the present invention principle, it is also possible to make some improvement and deformation, these Improve and deform the legal scope that also should be considered as the present invention.

Claims

1. the evaluation expert's intelligent recommendation method towards science and technology item, it is characterised in that the method comprises the following steps:

Step 1, the general term in science and technology item and expert info and usual word are disabled dictionary as specialty；Punctuation mark, Non-Chinese character is as cutting signature library；

Step 2, science and technology item information, expert info are carried out participle: according to cutting labelling in science and technology item information, by entry name Title, main research, technical specification are cut into substring sequence；According to cutting labelling in evaluation expert's information, extraction expert's letter Project that breath, prize-winning situation, invention situation, the situation that publishes thesis, problem undertook and performance, research direction are cut into son String sequence, that is one field information of a sub-string sequence；Chinese Academy of Sciences ICTCLAS antithetical phrase string sequence is utilized to carry out participle；

Step 3, science and technology item feature word extract: utilize and general disable dictionary and specialty disables dictionary and participle is carried out stop words Filtering, the described general dictionary that disables uses Harbin Institute of Technology to disable vocabulary, using the word segmentation result removing stop words as a word Set；

It is the most perfect process of a self study that specialty disables the structure of dictionary, constantly adds up word during information participle Word frequency, it, more than certain threshold value, is brought into and disables dictionary by the probability that word occurs at text；

Science and technology item quantity of information is relatively big, and set of words is carried out Semantic Similarity Measurement between word, according to the semantic relation of word and The cooccurrence relation of word builds term network, calculates the word aggregation characteristic value in network；Then in conjunction with the statistical characteristics of word, The criticality calculating word extracts science and technology item feature word；The feature word of science and technology item extracts comprehensive text Statistical nature information and semantic feature information, more accurately extract feature word；

Step 4, evaluation expert's feature word extract: disable dictionary and specialty disables dictionary and carries out stop words filtration according to general, Extract the Feature Words set of each expert；

Step 5, structure science and technology item, point field Knowledge Representation Model of evaluation expert: by vector space model and matter-element Knowledge Set model is extended, according to the different field information in science and technology item set up text representation model PRO=(id, F, WF, T, V), wherein id represents the identification field in project library；F represents field category set in science and technology item；WF is the power of field Weight；T is characterized word；V represents the word corresponding to field and weight set i.e. V thereof_i={ v_i1,f(v_i1),v_i2,f (v_i2),...,v_in,f(v_in), v_ijRepresent the jth feature word in i-th field, f (v_ij) represent v_ijCorresponding to key word Frequency；The representation of knowledge of science and technology item information is as follows:

In like manner, Knowledge Representation Model TM=(id, F, WF, T, V) is set up according to the different field information in expert；Wherein, id table Show the identification field in experts database；F represents field category set in evaluation expert；WF is the weight set of field；T is characterized Word；V represents the feature word corresponding to field and weight set i.e. V thereof_i={ v_i1,f(v_i1),v_i2,f(v_i2),...,v_in,f (v_in), v_ijRepresent the jth feature word in i-th field, f (v_ij) represent v_ijFeature word is in corresponding field The frequency of occurrences；The representation of knowledge of evaluation expert's information is:

Evaluation expert's information index storehouse builds: after evaluation Expert Knowledge Expression model construction completes, put in storage by information index: first From experts database, first read the content item information of an evaluation expert；Set up phrase semantic network based on word segmentation result and extraction is commented Examine the Feature Words that expert is comprised；Foundation Knowledge Representation Model also utilizes Apache Lucene that it is set up index；To establish Index add in the index database of correspondence by generic, until all of evaluation expert indexes warehouse-in；

Step 6, number according to project, it is recommended that mode is divided into single pending trial project recommendation expert and packet pending trial project recommendation special Family；The feature that packet recommends expert that the pending trial project knowledge of step 5 representing, model does between corresponding interfield and project merges Operation, single pending trial expert recommends only to do corresponding interfield feature union operation；Meanwhile, knowing the evaluation expert of step 5 Know and represent that model carries out interfield feature merging；According to Knowledge Representation Model the spy after utilizing Apache Lucene to be combined Reference breath sets up index；Wherein, science and technology item index construct is carried out when carrying out project recommendation；

In science and technology item declaration management system, pending trial project needs packet to recommend often, features described above union operation, it is ensured that Removal process 5 will not arrange different field weight Similarity Measure produces the contribution difference recommended by Knowledge Representation Model；

Step 7, merge through the interfield feature of the evaluation expert of step 6 and the Knowledge Representation Model of science and technology item, false If if evaluation expert's information vector is expressed as P={s₁,f(s₁),s₂,f(s₂),...,s_n,f(s_n), science and technology item information vector It is expressed as Q={t₁,f(t₁),t₂,f(t₂),...,t_n,f(t_n), calculate pending trial science and technology item vector based on maximum matching algorithm Semantic similarity with evaluation expert；

Step 8, similarity is set blocks, produce according to the size of similarity and recommend index, produce final recommendation evaluation expert List.

A kind of evaluation expert's intelligent recommendation method towards science and technology item the most according to claim 1, it is characterised in that: step Semantic Similarity Measurement process described in rapid 3 is as follows:

In knowing net semantic dictionary, if for two word W₁And W₂, W₁There is a n concept: S11, S12 ..., S1n, W₂There is m Concept: S21, S22 ..., S2m；Word W₁And W₂Similarity SimSEM (W1, W2) equal to the maximum of similarity of each concept Value:

S i m S E M (W 1, W 2) = \underset{i = 1, ... n . j = 1 ... m}{m a x} S i m (S_{1 i}, S_{2 j});

Notional word and function word have different description language, need to calculate the syntax justice of its correspondence is former or relation adopted former between similar Degree；Notional word concept includes that the first basic meaning is former, other basic meanings are former, the adopted former description of relation, relational symbol describe, and similarity is respectively It is designated as Sim1 (p₁,p₂)、Sim2(p₁,p₂)、Sim3(p₁,p₂)、Sim4(p₁,p₂)；The Similarity Measure of two feature structures is final Revert to the Similarity Measure of the former or concrete word of basic meaning；

{Sim}_{4} (S_{1}, S_{2}) = Σ_{i = 1}^{4} β_{i} {Sim}_{i} (S_{1}, S_{2});

β_i(1≤i≤4) are adjustable parameters, and have: β₁+β₂+β₃+β₄=1, β₁≥β₂≥β₃≥β₄；

If CW={C1, C2 ..., Cm} be process after the set of words that obtains, the semantic similarity adjacency matrix S of its correspondence_mFixed Justice is:

Wherein, Sim (C₁,C₂) it is word C₁With word C₂Semantic similarity, Sim (C_i,C_i) it is 1, Sim (C_i,C_j)=Sim (C_j,C_i)；

Set of words CW={C1, C2 ..., Cm} is calculated phase between m × (1+m)/2 word through semantic similarity Value like degree；

Word co-occurrence model is one of important models of natural language processing research field based on statistical method；According to Term co-occurrence mould Type, if two frequent co-occurrences of word are at the same window unit of document, the two word is to be mutually related in meaning, and they are one Determine to express in degree the semantic information of the text；Utilize sliding window that the word in sequence of terms carries out word co-occurrence degree meter Calculate:

First, sequence of terms is carried out word extraction, i.e. remove space, null and merge identical word, obtain set of words CW={C1, C2 ..., Cm}, wherein m≤n；

When Cm is initial, Coo (Ci, Cj) is 01 (1≤i, j≤m)；

By sliding window, sequence of terms being carried out word co-occurrence degree to calculate, the word in sliding window is T_i-1T_iT_i+1(1 < i < n):

1) if i=n-1,4 are turned)；If T_i-1Being space or null, sliding window slides to next word, i++；Otherwise, 2 are turned)；

2) if T_iFor Chinese, then Coo (T_i-1,T_i) ++, turn 3)；If T_iFor null, turn 3)；Otherwise turn 1)；

3) if T_iChinese, then Coo (T_i-1,T_i+1) ++, i++, turn 1)；Otherwise, 1 is turned)；

4) if T_n-2It is Chinese, turns 5)；Otherwise, 7 are turned)

5) if T_n-1It is Chinese, Coo (T_n-2,T_n-1) ++, turn 6)；If T_n-1It is space, turns 6)；Otherwise terminate；

6) if T_nIt is Chinese, Coo (T_n-2,T_n) ++, terminate；Otherwise terminate；

7) if T_n-1It is Chinese, and T_nAlso be Chinese, then Coo (T_n-1,T_n) ++, terminate；Otherwise terminate；

Through the calculating of previous step, obtain word co-occurrence degree Matrix C m, and each element of Cm be normalized, Namely each element is divided by the maximum of all elements in matrix, i.e. max{Coo (C_i,C_j)|1≤i,j≤m}；

Described term network is as follows:

Wherein, α is 0.3, and β is 0.7, the semantic relation between strengthening word, weakens the cooccurrence relation between word；

W_mAs the adjacency matrix that the term network of input is corresponding, then the network of its correspondence is defined as: G={V, E}；Wherein scheme G For undirected weighted graph, V represents the vertex set in figure G, and E represents the limit collection in G, v_iRepresent i-th summit in V；

The distribution of key character degree of having, average shortest path length, concentration class and the convergence factor of term network；The degree of node embodies this joint Point associates situation with other node；The node that the concentration class of node and convergence factor are embodied in this node subrange is mutual Connection Density；The degree of node and convergence factor embody this node importance in subrange；Measures and weights, poly-is added by node Collection coefficient and node betweenness calculate the aggregation characteristic value of node, important word can be allowed to give higher weights, ensure again The word related word important with many also has higher scoring；

In semantic similarity network, unordered couple (v_i,v_j) represent node v_iWith v_jBetween limit, then node v_iAdd The definition of measures and weights is:

{WD}_{i} = Σ_{j = 1}^{n} w_{i j} / n;

Wherein, w_ijFor node v_iWith v_jBetween weights on limit, n is total number of node；

In semantic similarity network, unordered couple (v_i,v_j) represent node v_iWith v_jBetween limit, node v_iNon-power Measures and weights D_iFor D_i=| { (v_i,v_j):(v_i,v_j)∈E,v_i,v_j∈V}|；Node v_iConcentration class K_iFor the reality existed between neighbor node Limit, border number: T_i=| { (v_j,v_k):(v_i,v_k)∈E,(v_j,v_k)∈E,v_i,v_j∈ V} |, then node v_jConvergence factor C_iDefinition For:

C_{i} = \frac{T_{i}}{(\begin{matrix} D_{i} \\ 2 \end{matrix})} = 2 T_{i} / D_{i} (D_{i} - 1);

In semantic similarity network, node betweenness Betweenness is between node x and w and shortest path is by joint Point v_iProbability probability；Pair Analysis between two nonneighbor nodes depends on the node on the shortest path connecting point-to-point transmission, These nodes potential playing the part of controls the role of interactive information stream, B between node_iEmbody node v_iDegree of connecting under local environment, Then the definition of node betweenness Betweenness is:

B_{i} = \underset{w &Element; G, x &Element; G}{Σ} \frac{r_{v i} (w, x)}{d (w, x)};

D (w, x) represents in cum rights semantic similarity network shortest path number between any two node w and x, Represent any two node w and x and through v_iShortest path number v_i∈G；

By node v_iAverage weighted degree, convergence factor and betweenness Betweenness to be weighted comprehensively weighing the gathering of node special Value indicative, node v_iAggregation characteristic value Z_iDefinition be:

Z_{i} = a \times {WD}_{i} + b \times C_{i} / Σ_{j = 1}^{n} C_{j} + c \times B_{i};

Wherein, a+b+c=1；

Use nonlinear function that word frequency is normalized；Word W_iWord frequency weight TFi in the text is defined as:

T F i = \frac{f (W i)}{Σ_{j = 1}^{n} f (p_{j})};

Wherein, TFi represents word W_iWord frequency weight, p_jRepresenting certain word in text, f is word frequency statistics function；

Word W_iPart of speech weight posi in the text is defined as:

Word is the longest more can reflect concrete information, otherwise, the represented meaning of shorter word is the most abstract；The most in a document Feature word be mostly some specialties academic combination vocabulary, length is longer, and its implication is clearer and more definite, more can reflect text subject；Increase Add the weight of long word, be conducive to vocabulary is split, thus reflect word significance level in a document more accurately；

Word W_iLong weight leni of word in the text is defined as:

For each word in sequence of terms, its statistical characteristics is

stats_i=A*TF_i+B*pos_i+C*len_i；

Wherein, A+B+C=1；

Described word W_iThe calculating process of criticality is as follows:

Imp_i=β * stats_i+(1-β)*Z_i；

Wherein, 0 ＜ β ＜ 1；

The value of criticality will be obtained by calculating, sort from big to small, and set a threshold gamma, 0 ＜ γ ＜ 1, take out first q Value, then these words are using the feature word as science and technology item, and these words fully reflect theme, and are important words Language.

A kind of evaluation expert's intelligent recommendation method towards science and technology item the most according to claim 1, it is characterised in that: step It is as follows that feature merga pass logic xor operation described in rapid 6 carries out process:

(1) pending trial project, the interfield feature of an evaluation expert merge

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{f ({word}_{1 i}) + f ({word}_{2 i})}{2}} | {word}_{1 i} = {word}_{2 j}};

Wherein, word_1i, word_2jIt is characterized word；

Add field weight improve and extend above-mentioned definition, the interfield feature of evaluation expert, science and technology item is merged, close And rule is:

{W^{'}}_{1} &CirclePlus; {W^{'}}_{2} = {&ForAll; i, j, {{word}_{1 i}, \frac{w 1 * f ({word}_{1 i}) + w 2 * f ({word}_{2 i})}{\sqrt{w 1^{2} + w 2^{2}}}} | {word}_{1 i} = {word}_{2 j}};

(2) between the project of packet pending trial project, feature merges

The operation of this merging process, just for the characteristic vector of pending trial science and technology item, is not for evaluation expert's characteristic vector, expert Characteristic vector has only to do interfield feature union operation；If V is (d₁) and V (d₂) it is that two science and technology items are through interfield respectively Vector model after feature merging, to any t_1j∈V(d₁), t_2j∈V(d₂), if there is t_1jWith t_2jIdentical, merge；It is defined as:

V (d_{1}) &CirclePlus; V (d_{2}) = {< t_{k}, w_{k} (p) = \frac{w_{i} (d_{1}) + w_{j} (d_{2})}{2} >};

Wherein, k=1 ..., n, t_kIt is characterized entry item, w_kP () is t_kWeight；

The basic process that Knowledge Representation Model produces is as follows:

B). all science and technology item vector model set are used consolidation strategyBy above-mentioned Method, science and technology item is set up the Knowledge Representation Model of vector space of being based on；

V (p)={ ＜ t₁,w₁(p) ＞, ＜ t₂,w₂(p) ＞ ..., ＜ t_n,w_n(p) ＞ }；