CN113220899A

CN113220899A - Intellectual property identity identification method based on academic talent information intellectual map

Info

Publication number: CN113220899A
Application number: CN202110506792.XA
Authority: CN
Inventors: 郑中华; 胡淦; 王文仲
Original assignee: Shanghai Boyi Information Technology Co ltd
Current assignee: Shanghai Boyi Information Technology Co ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-08-06

Abstract

The invention discloses an intellectual property identity identification method based on an academic talent information intellectual map. The method aims at intellectual property information of the whole network, such as invention patents, thesis, soft works and the like, realizes the identification of the authors of the intellectual property information, and further completes the establishment of an academic talent information knowledge base in the big data industry. The invention has positive effect on discovering and effectively reserving high-quality talents as soon as possible.

Description

Intellectual property identity identification method based on academic talent information intellectual map

Technical Field

The invention relates to an intellectual property identity identification method based on an academic talent information intellectual map. The method mainly aims at intellectual property information of the whole network, such as patent inventions, thesis, soft works and the like, and realizes the identification of the author of the intellectual property information by the method, thereby completing the establishment of an academic talent information knowledge base in the big data industry, and having positive effects on early discovery and effective reservation of high-quality talents.

Background

The traditional entity linking method mainly comprises three processes of nominal identification, candidate entity generation and candidate entity sequencing, wherein the nominal identification is mostly acquired based on an entity identification technology, the candidate entity generation generally comprises information extraction of a knowledge base, an associated dictionary corresponding to an entity is constructed, and a large number of candidate entities can be generated only by simply matching character strings in the dictionary when candidates are generated according to the dictionary. The knowledge base is generally selected as Wikipedia, so that data limitation is large, and meanwhile, only a dictionary matching mode is adopted, so that not only are too many candidate entities caused, not only is resource waste caused, but also interference items are improved, and the accuracy is reduced.

The mainstream candidate entity ordering method adopts the idea based on similarity comparison, and the basic idea is to select the candidate entity with the maximum similarity as the link target by calculating the context similarity between the entity nominal item extracted from the text and the candidate entity obtained from the knowledge base query. Most of the similarity calculation is carried out by adopting a machine learning method based on artificially defined rules, for example, context features (keywords) and page structure features (such as page redirection, anchor text and the like) of candidate entities on a Wikipedia page are added during solving; or selecting entity popularity to assist disambiguation; or choose to add consideration to the relevance of categories between entities (link relationships, probability of co-occurrence, etc.); the method based on the artificial definition rule has large limitation, cannot acquire all rule information comprehensively, and meanwhile, the context information of the candidate entity based on the Wikipedia or the encyclopedia is not comprehensive and disordered enough, so that the method brings obstruction to accurate identification.

In addition, entity linking methods based on deep learning are also becoming more popular, compared with traditional methods, deep learning methods do not need to define relevant features manually, such as document representation of learning entities based on a Deep Neural Network (DNN), and category representation is obtained by using a CNN; or the expression, the entity and the context are subjected to embedded expression, the characteristics are extracted through CNN, and finally the similarity of the expression and the entity is calculated for linking; or the BERT pre-training language model analyzes the context of the entity nominal item and the correlation information of the candidate entity, and enhances the result of entity link by improving the semantic analysis effect. However, the deep learning method requires too much data and has high requirements on machine performance, especially in a big data scene, and meanwhile, deep learning completely depends on the corpus, and if the deep learning method is completely depended on, the effect may be worse when the corpus is deviated.

Disclosure of Invention

The invention aims to provide an intellectual property identity identification method based on an academic talent information intellectual map.

In order to solve the technical problems, the invention adopts the technical scheme that the intellectual property identity identification method based on the academic talent information intellectual map comprises the following steps:

(1) the crawler acquires talent information data comprising names, resumes and intellectual property information, and establishes a talent information intellectual map based on neo4j according to the information; the knowledge graph is formed in a triple E ═ sub, rela, obj > form and specifically comprises attribute information of the entities and the relationship among the entities;

(2) and (3) nominal identification: for intellectual property information M to be linked, directly acquiring structural form characteristic information and text information of intellectual property in the intellectual property information M based on a regularization rule, wherein M is (M ═₁,M₂,…,M_n) Wherein M is_iObtaining the nominal item;

(3) candidate entity generation: dividing intellectual property information corresponding to entities in the intellectual property map into 4 subject categories of text, theory, agriculture and medicine, and constructing a subject classification model based on word2vec and TextCNN for judgment;

the process of candidate entity generation is as follows:

firstly, for intellectual property information M to be linked, matching is carried out through fuzzy query of an intellectual map according to a designated item, and a possible entity set D is obtained (D)₁,D₂,…,D_n)；

Secondly, obtaining the category of the intellectual property information M by using the trained discipline classification model, wherein M is_type＝TextCNN(M)；

And finally, respectively inputting the entities in the set D into the trained discipline classification model to obtain categories, wherein the final candidate entities are H (H ═ H)₁,H₂,…,H_k)，{Hi∈D,Hi_type＝M_typeI ═ 1,2, …, k }, where k is the final number of candidate entities after class filtering;

(4) candidate entity ordering: respectively from form features F_formalAnd semantic features F_semanticTwo aspects are used for carrying out entity sequencing; for each candidate entity Hi, all information can be obtained based on the knowledge graph except the structural form characteristic information (Gi) directly obtained from the graph₁，Gi₂,…，Gi_n) Besides, the method also comprises semantic features Gi based on the intellectual property content_n+1That is, all the information of the candidate entity Hi is Gi ═ (Gi)₁,Gi₂,…,Gi_n,Gi_n+1)；

The specific candidate entity ordering process is as follows:

determining weight information of each feature as Wi ═ by using AHP (approximate height-weighted prediction) method₁,Wi₂,…,Wi_n,Wi_n+1)；

Form character F_formalSolving: for each candidate entity Hi, all information can be obtained based on the knowledge graph, the text content of the intellectual property information is removed, and only the structural form characteristic information Gi ═ is reserved₁，Gi₂,…,Gi_n) (ii) a For the intellectual property information M to be linked, the form characteristics are determined by calculating the matching degree of M and Gi', and the matching degree of M and Gi

Wherein M is_jStructural formal characteristic information, i.e. F, for intellectual property information M to be linked_formal＝S_k；

Semantic feature F_semanticSolving: between intellectual property information M to be linked and candidate entity HiSemantic feature F of_semanticThe value of (c) is measured by similarity, and character-based CBA (charbi _ lstm + attribute) network is selected to solve the similarity, specifically, semantic features Gi of candidate entities Hi_n+1And text information M of intellectual property information M to be linked_nThe similarity probabilities [ y1, y2] are determined by a softmax layer after a bi _ lstm + attention layer, respectively]The final similarity probability y1 is F_semanticTaking the value of (A);

sorting: the final similarity Fi F between the candidate entity Hi and the intellectual property information M to be linked_semantic+F_formalThe final linked entity is the entity corresponding to max (fi).

Preferably, the intellectual property information includes information of articles and patents; the attribute information of the entity comprises basic information of a person, graduate colleges, professions, work units and intellectual property information; the relationships among the entities comprise cooperative relationships and alumni relationships.

Preferably, the structured form characteristic information includes author information, organization information, partner information, and periodical information.

Preferably, in the step (3), the construction of the TextCNN subject classification model based on word2vec includes the following specific steps:

firstly, collecting text data of intellectual property rights and labeling, wherein the labeled categories are respectively 4 categories of text (0), theory (1), agriculture (2) and medicine (3);

secondly, segmenting words and stopping the words;

thirdly, training a word2vec model, performing embedding to obtain nxk-dimensional vectors, and converting the class labels into a form of unique hot codes;

and finally, inputting the embedded vectors and the labels into a TextCNN network to train to obtain a subject classification model, wherein the subject classification model structure comprises a convolution and pooling layer, a data splicing layer, a faltent layer, a dropout layer, a full connection layer and a softmax layer.

Preferably, in the step (4), the AHP solving process first constructs an application example through an expert system, obtains an importance judgment matrix of each feature information through example analysis, and then calculates the weight by using SPSSAU software.

Preferably, the character-based CBA network similarity solution in step (4) includes the following specific steps:

first, data is collected and labeled in the form of s1, s2, label, which is 0 if s1 is similar to s2, otherwise it is 1;

secondly, obtaining vector representation of word2vec based on characters;

thirdly, inputting the data into a CBA network for training to obtain a trained CBA model;

and finally, inputting the candidate entity Hi to be identified and the intellectual property information M to be linked into the trained CBA model, and acquiring an output result [ y1, y2] of the softmax layer, wherein y1 is the similarity.

The invention has the beneficial effects that:

1. the traditional entity link can not effectively express the association degree problem between the nominal item and the candidate entity directly according to the characteristics extracted from the entity context texts such as Wikipedia and the like.

2. The traditional candidate entity generation method is directly matched without screening the part, so that the workload is increased, the interference rate is improved, and meanwhile, the situation that most of prediction error data is caused by type errors due to the fact that semantic information is directly matched is not considered.

3. In consideration of the similarity based on the conventional artificial definition rule and the superiority and inferiority based on the deep learning similarity, the invention integrates the formal features based on the knowledge map and the semantic features based on the intellectual property rights in the candidate entity ordering process, and simultaneously realizes the ordering of the similarity of the candidate entities by the AHP method based on statistics.

4. Meanwhile, in the solving process of semantic features based on intellectual property, the accuracy of the similarity of traditional cosine solving based on word2vec is low, an improved siamese CBA is adopted, characters are based, an attention mechanism is added, the similarity probability of the network is extracted to serve as a similarity value, and the accuracy is improved.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a schematic diagram of a TextCNN network structure according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a CBA network structure according to an embodiment of the present invention.

Detailed Description

The intellectual property identity identification method based on the academic talent information intellectual map comprises the following steps:

1. the crawler acquires talent information data, including names, resumes and intellectual property information, wherein the intellectual property information includes information such as articles and patents, and establishes a talent information intellectual map based on neo4j according to the information, the intellectual map is formed in a triple E ═ sub, rela, obj > form and specifically includes attribute information of entities (talents) and relationships among the entities, wherein the attribute information of the entities includes basic information of people, colleges and universities, professions, work units and intellectual property information, and the relationships among the entities include a cooperative relationship and an alumni relationship.

2. And (3) nominal identification: for the intellectual property information M to be linked, due to the normalization of the intellectual property information format, the structured form feature information and the intellectual property content information can be directly obtained based on the regularization rule, wherein M is (M ═₁,M₂,…,M_n) And the structural form characteristic information comprises author information, organization information, partner information and periodical information.

3. Candidate entity generation: in the conventional method for acquiring candidate entities, matching is generally performed by directly querying features in a map, or entity matching is performed according to similarity methods such as edit distance, but from the previous results, more than half of errors in a sample with model prediction errors are found to be of different types, so that the embodiment provides a type-based candidate entity generation method, which achieves a less and more precise target, improves accuracy and reduces calculation workload.

Specifically, in order to prevent omission caused by too fine type division, the embodiment divides intellectual property information corresponding to the entities in the intellectual property map into 4 subject categories of text, physics, agriculture, and medicine, and constructs a subject classification model based on word2vec and TextCNN.

The generation process of the candidate entity firstly obtains the candidate entity according to the nominal item, and then further screens by utilizing subject categories so as to reduce the operation amount. The method comprises the following specific steps:

firstly, for intellectual property information M to be linked, matching is carried out through fuzzy query of an intellectual map according to a designated item, and a possible entity set D is obtained (D)₁,D₂,…,D_n) (ii) a The part of query can be directly obtained by using the statement query of neo4j, so that the query is quick;

And finally, respectively inputting the entities in the set D into the trained discipline classification model to obtain categories, wherein the final candidate entities are H (H ═ H)₁,H₂,…,H_k)，{Hi∈D,Hi_type＝M_typeAnd i is 1,2, …, k, where k is the final number of candidate entities after class filtering.

The construction of the textCNN subject classification model based on word2vec comprises the following specific steps:

secondly, segmenting words and stopping the words;

Fig. 1 shows a specific structure of the TextCNN network.

4. Candidate entity ordering: most of the traditional candidate entity sorting methods are based on similarity, for example, sorting is carried out by means of artificially defined characteristics between the referee and the target entity; or directly solving the similarity between the context information and the user information by utilizing a deep learning method. Considering the superiority and inferiority of each of the two methods, the embodiment integrates the two methods, fuses the structural information of the knowledge graph, and respectively adopts the form characteristics F_formalAnd semantic features F_semanticTwo aspects are used to perform entity ordering.

For each candidate entity Hi, all information can be obtained based on the knowledge graph except the structural form characteristic information (Gi) directly obtained from the graph₁，Gi₂，…，Gi_n) Besides, the method also comprises semantic features Gi based on the intellectual property content_n+1That is, all the information of the candidate entity Hi is Gi ═ (Gi)₁，Gi₂，…，Gi_n，Gi_n+1)。

The specific candidate entity ordering process is as follows:

1) determining weight information of each feature as Wi ═ by using AHP (approximate height-weighted prediction) method₁,Wi₂,…,Wi_n,Wi_n+1) (ii) a The specific AHP solving method can firstly construct an application example, obtain an importance judgment matrix of each characteristic information through example analysis, and then calculate the weight by utilizing SPSSAU software.

2) Form character F_formalSolving: for each candidate entity Hi, all information can be obtained based on the knowledge graph, the text content of the intellectual property information is removed, and only the structural form characteristic information Gi ═ is reserved₁,Gi₂,…,Gi_n) (ii) a For intellectual property information M to be linked, determining the shape by calculating the matching degree of M and GiFormula characteristics, degree of matching of M to Gi

Wherein M is_jStructural formal characteristic information, i.e. F, for intellectual property information M to be linked_formal＝S_k。

3) Semantic feature F_semanticSolving: semantic features F between intellectual property information M to be linked and candidate entities Hi_semanticThe value of (1) is measured by similarity, and character-based CBA (charbi _ lstm + attribute) network is selected to solve the similarity, specifically, semantic features Gi of candidate entities Hi are used for solving the similarity_n+1And text information M of intellectual property information M to be linked_nInputting the probability into CBA network to obtain the similar probability [ y1, y2]]The final similarity probability y1 is F_semanticThe value of (a).

As shown in fig. 2, the character-based CBA network construction process includes the following steps:

secondly, obtaining vector representation of word2vec based on characters;

and finally, inputting the data and the label into a CBA network for training to obtain a trained CBA model.

4) Sorting: the final similarity Fi F between the candidate entity Hi and the intellectual property information M to be linked_semantic+F_formalThe final linked entity is the entity corresponding to max (fi).

The embodiment has the following technical characteristics:

1. the embodiment integrates data accumulation of the big data industry, provides the entity link model fusing the structural information of the knowledge graph, and improves the convenience, richness and effectiveness of data.

2. According to the candidate entity generation method based on the type, the calculation workload is reduced, meanwhile, the interference items are reduced, and the accuracy is improved.

3. In the candidate entity ordering process, the formal features based on the intellectual map and the semantic features based on the intellectual property are integrated, and the ordering of the similarity of the candidate entities is realized by the AHP method based on statistics.

4. An improved siamese network CBA is adopted, based on characters, an attention mechanism is added, the similarity probability of the network is extracted to serve as a similarity value, the method replaces the traditional cosine similarity method directly based on word2vec, word segmentation interference is avoided, semantic relevance is increased, and accuracy is improved.

The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. The intellectual property identity identification method based on the academic talent information intellectual map comprises the following steps:

(2) and (3) nominal identification: for intellectual property information M to be linked, directly acquiring structural form characteristic information and text information of intellectual property in the intellectual property information M based on a regularization rule, wherein M is (M ═₁，M₂，…，M_n) Wherein M is_iObtaining the nominal item;

the process of candidate entity generation is as follows:

firstly, for intellectual property information M to be linked, matching is carried out through fuzzy query of an intellectual map according to a designated item, and a possible entity set D is obtained (D)₁，D₂，…，D_n)；

And finally, respectively inputting the entities in the set D into the trained discipline classification model to obtain categories, wherein the final candidate entities are H (H ═ H)₁，H₂，…，H_k)，{Hi∈D，Hi_type＝M_typeI ═ 1,2, …, k }, where k is the final number of candidate entities after class filtering;

(4) candidate entity ordering: respectively from form features F_formalAnd semantic features F_semanticTwo aspects are used for carrying out entity sequencing; for each candidate entity Hi, all information can be obtained based on the knowledge graph except the structural form characteristic information (Gi) directly obtained from the graph₁，Gi₂，…，Gi_n) Besides, the method also comprises semantic features Gi based on the intellectual property content_n+1That is, all the information of the candidate entity Hi is Gi ═ (Gi)₁，Gi₂，…，Gi_n，Gi_n+1)；

The specific candidate entity ordering process is as follows:

determining weight information of each feature as Wi ═ by using AHP (approximate height-weighted prediction) method₁，Wi₂，…，Wi_n，Wi_n+1)；

Form character F_formalSolving: for each candidate entity Hi, all information can be obtained based on the knowledge graph, the text content of the intellectual property information is removed, and only the structural form characteristic information Gi ═ is reserved₁，Gi₂，…，Gi_n) (ii) a For the intellectual property information M to be linked, the form characteristics are determined by calculating the matching degree of M and Gi', and the matching degree of M and Gi

Semantic feature F_semanticSolving: semantic features F between intellectual property information M to be linked and candidate entities Hi_semanticThe value of (c) is measured by similarity, and character-based CBA (charbi-lstm + attribute) network is selected to solve the similarity, specifically, semantic features Gi of candidate entities Hi_n+1And text information M of intellectual property information M to be linked_nThe similarity probabilities [ y1, y2] are determined by a softmax layer after passing through a bi-lstm + attention layer respectively]The final similarity probability y1 is F_semanticTaking the value of (A);

sorting: the final similarity of the candidate entity Hi and the intellectual property information M to be linked is listed as F_semantic+F_formalThe final linked entity is the entity corresponding to max (fi).

2. The intellectual property identity recognition method of claim 1, wherein: the intellectual property information comprises information of articles and patents; the attribute information of the entity comprises basic information of a person, graduate colleges, professions, work units and intellectual property information; the relationships among the entities comprise cooperative relationships and alumni relationships.

3. The intellectual property identity recognition method of claim 1, wherein: the structural form characteristic information comprises author information, organization information, partner information and periodical information.

4. The intellectual property identity recognition method of claim 1, wherein: in the step (3), the construction of the TextCNN subject classification model based on word2vec comprises the following specific steps:

secondly, segmenting words and stopping the words;

5. The intellectual property identity recognition method of claim 1, wherein: in the step (4), the AHP solving process firstly constructs an application example through an expert system, obtains an importance judgment matrix of each characteristic information through example analysis, and then calculates the weight by utilizing SPSSAU software.

6. The intellectual property identity recognition method of claim 1, wherein: the character-based CBA network similarity solving method in the step (4) comprises the following specific steps:

first, data is collected and labeled as s1, s2, label, with s1 being similar to s2, labe1 being 0, otherwise 1;

secondly, obtaining vector representation of word2vec based on characters;