CN113569575A - Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping - Google Patents

Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping Download PDF

Info

Publication number
CN113569575A
CN113569575A CN202110913345.6A CN202110913345A CN113569575A CN 113569575 A CN113569575 A CN 113569575A CN 202110913345 A CN202110913345 A CN 202110913345A CN 113569575 A CN113569575 A CN 113569575A
Authority
CN
China
Prior art keywords
entity
semantic
pictographic
mapping
expert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110913345.6A
Other languages
Chinese (zh)
Other versions
CN113569575B (en
Inventor
杨政
尹春林
朱华
苏蒙
潘侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Electric Power Research Institute of Yunnan Power Grid Co Ltd
Original Assignee
Electric Power Research Institute of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Electric Power Research Institute of Yunnan Power Grid Co Ltd filed Critical Electric Power Research Institute of Yunnan Power Grid Co Ltd
Priority to CN202110913345.6A priority Critical patent/CN113569575B/en
Publication of CN113569575A publication Critical patent/CN113569575A/en
Application granted granted Critical
Publication of CN113569575B publication Critical patent/CN113569575B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to the technical field of expert recommendation, and provides an expert review recommendation method based on pictographic-semantic dual-feature space mapping. The entity matching strategy based on semantic-pictographic dual-feature space mapping is provided, effective and accurate matching of projects and experts is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.

Description

Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping
Technical Field
The application relates to the technical field of expert recommendation, in particular to a review expert recommendation method based on pictograph-semantic dual-feature space mapping.
Background
With the fact that theoretical innovation of national power grids is greatly promoted in the aspects of extra-high voltage alternating current and direct current power grids, smart power grids, third industrial revolution and the like, the application amount of various innovative electric power science and technology projects is greatly increased, and further the application number of the electric power science and technology projects is continuously increased.
In this case, the current review task of the power science and technology project application is difficult and burdensome, and the form review to the content quality review needs to be completed with high quality and high efficiency. The most important link in the auditing process is that the expert audits the content quality of the application, so that the accurate evaluation result can be obtained only by matching the technology and the field mastered by the auditing expert with the content of the application, and the reliability of the evaluation result is directly hooked with the matching degree.
However, most of the matching work of the review experts and the project application is manually and randomly issued at present or is specially recommended by talents with profound expertise. Due to the subjective initiative inevitably existing in the manual operation and the matching mode of the current evaluation experts and the project application form, the labor cost of evaluation work is too high, the reliability of evaluation results is poor, and the overall evaluation efficiency is low.
Disclosure of Invention
In order to overcome the defects of the prior art, the application aims to provide a recommendation method of review experts based on pictograph-semantic dual-feature space mapping, so as to solve at least one technical problem of overhigh labor cost of review work, weaker reliability of review results and lower overall review efficiency.
In order to achieve the above object, the present application provides a review expert recommendation method based on pictograph-semantic dual feature space mapping, which specifically includes:
and acquiring abstract information of the electric power science and technology project application.
And carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.
Crawling the personal homepage information of the power expert and the abstract information of published papers.
And carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.
And carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.
And carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.
And carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.
And carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.
And calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.
And determining the review experts according to the level of all the comprehensive matching scores.
Further, named entity recognition is carried out by utilizing a RoBERTA pre-training model and a BilSTM + CRF model.
Further, the specific method for carrying out named entity recognition by utilizing the RoBERTA pre-training model and the BilSTM + CRF model comprises the following steps:
and acquiring text information.
And segmenting the text information to obtain a word set.
And performing vector mapping on the word set by using a RoBerta pre-training model to obtain a word vector set.
And training the word vector set by using a BilSTM + CRF model to obtain the named entity of the text information.
Further, the specific method for obtaining the research direction entity and the adept technical entity comprises the following steps:
and acquiring keywords of the evaluation expert information.
And crawling the personal homepage information of the experts and the abstract information of the published papers according to the keywords.
And carrying out named entity identification on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity.
And carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.
Further, the specific method for obtaining the comprehensive matching score by calculation is as follows:
and performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score.
And performing cosine similarity calculation on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space to obtain a first semantic matching score.
And performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score.
And performing cosine similarity calculation on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space to obtain a second semantic matching score.
And summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score.
And summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score.
And carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.
Further, the calculation of the euclidean distance is performed by the following method:
Figure BDA0003204690640000021
wherein D is the entity similarity score of the domain (direction) level, F is the set corresponding to the related domain entities, R is the set corresponding to the research direction entities,
Figure BDA0003204690640000022
embedding the pictographic space corresponding to the entity in the F set,
Figure BDA0003204690640000023
embedding the pictographic space corresponding to the entity in the R set,
Figure BDA0003204690640000024
embedding semantic space corresponding to the entities in the F set,
Figure BDA0003204690640000025
and embedding the semantic space corresponding to the entities in the R set.
Further, the cosine similarity calculation is performed by adopting the following method:
Figure BDA0003204690640000026
wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,
Figure BDA0003204690640000031
embedding the pictographic space corresponding to the entity in the O set,
Figure BDA0003204690640000032
embedding the pictographic space corresponding to the entity in the L set,
Figure BDA0003204690640000033
embedding semantic space corresponding to the entities in the O set,
Figure BDA0003204690640000034
and embedding the semantic space corresponding to the entity in the L set.
Further, the following method is adopted for calculating the comprehensive matching score:
score=k×D+(1-k)×T
where score is the composite match score and k is the weight.
Further, a greedy algorithm is adopted to calculate the k value.
Further, the k value is set to 0.3.
The method comprises the steps of firstly, utilizing a RoBerta pre-training model to hierarchically express a text, then using a Bi-LSTM + CRF model to identify a named entity of an electric power project text and an electric power expert text, then mapping the named entity into a feature vector through the Roberta-semantic dual feature space, carrying out Euclidean distance and cosine similarity calculation on the obtained feature vector to obtain a related matching score, then carrying out weighted summation on the related matching score to obtain a comprehensive matching score, and finally taking an expert with the highest comprehensive matching score as an expert for reviewing the electric power project text. The project text and domain expert entity matching strategy based on semantic-pictographic double-feature space mapping is provided, effective and accurate matching of the project and the domain expert is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flowchart of a recommendation method for review experts based on pictograph-semantic dual feature space mapping according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a method for calculating a composite matching score according to an embodiment of the present disclosure;
fig. 3(a) is a schematic diagram of a result of identifying an entity in a field related to an application provided in an embodiment of the present application;
fig. 3(b) is a schematic diagram of an entity identification result of an application usage method provided in the embodiment of the present application;
fig. 4 is a schematic diagram of a heterogeneous matching process based on dual feature space mapping according to an embodiment of the present application;
fig. 5 is a schematic diagram illustrating comparison between matching effects of an electric power project entity and an electric power expert entity according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be fully and clearly described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to facilitate understanding of the technical solutions of the present application, some concepts related to the present application are first described below.
The RoBERTA model is an improved Chinese pre-training model of BERT, and compared with the traditional BERT, the RoBERTA model increases the Batch size, introduces a dynamic Masking mechanism, expands a training sample, and removes the constraint of an NSP (next presence prediction) item in a loss function. Specifically, the Batch size of the model was increased from 256 to 8000, and 10 different Masking methods were used, so that the samples in different epochs were not masked by the fixed Masking, and the training data was changed from 13G to 160G.
Specifically, the RoBerta model input is composed of a word vector, a sentence vector and a position quantity. The word vector comprises a coding vector of the category symbol and a coding vector of the separator; the sentence vector is a coding vector used for distinguishing different sentences; the position quantity is a coding vector of corresponding positions of different words in the sentence. The model output is a word embedding matrix after all words of the sentence are coded by the self-attention coder.
The Recurrent Neural Network (RNN) is the most widely applied neural network in the task of sequence relation learning, and the bidirectional long-and-short time memory network (Bi-LSTM) is a variant of the RNN, has bidirectional time sequence characteristics and a special gate control structure, and can effectively solve the problems of gradient disappearance and explosion.
Conditional Random Fields (CRF) are a commonly used sequence labeling algorithm.
The matching work of the experts and the project application can be regarded as heterogeneous data matching, and the heterogeneous data matching is that different sources of data are optimized in a preprocessing mode and then matched with each other, and finally reasonable output is obtained. The preprocessing of the project application data relates to a technology of named entity identification, namely, the proprietary names such as the names of people, places and organizations in the text are identified and classified, and the preprocessing of the expert data can be extracted from a webpage and then is finished through a crawler technology. After the data processed by the two are obtained, a more accurate and efficient result than manual recommendation can be obtained by means of heterogeneous matching
Specifically, in identifying the named entities of the power project application, the following concepts are first defined:
(1) the use method entity comprises the following steps: the methods used in the application, such as: zero sequence harmonic component principle, thermal step current method, electromagnetic coupling principle.
(2) Relating to a field entity: the application relates to the fields such as: reactive compensation, power transformation engineering and economic power transmission.
Referring to fig. 1, a schematic flow chart of a review expert recommendation method based on pictograph-semantic dual feature space mapping provided in an embodiment of the present application is shown. The embodiment of the application provides a review expert recommendation method based on pictograph-semantic dual-feature space mapping, which specifically comprises the following steps:
step S1: and acquiring abstract information of the electric power science and technology project application.
Step S2: and carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.
Step S3: crawling the personal homepage information of the power expert and the abstract information of published papers.
Step S4: and carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.
Further, named entity recognition is carried out by utilizing a RoBERTA pre-training model and a BilSTM + CRF model. Specifically, the feature extractor of the RoBerta model is a bidirectional Transformer, and each unit of the Transformer is composed of a self-Attention layer (self-Attention), a Feed-Forward neural Network (Feed Forward Network) and a Normalization layer (Add & Normalization), and the structure can make full use of context information to capture the dependency relationship of longer distance.
In the embodiment of the present application, the BiLSTM model takes the score matrix of each word and label as output, which is called "emission matrix" a, and specifically includes: and (3) taking the mapped value of the word hidden layer vector through a linear layer (namely, using BilSTM as the last step of classification, and mapping the hidden state into a score) as a score matrix of the label corresponding to the word.
Meanwhile, the embodiment of the application selects a linear CRF model to learn the internal relation among the labels in the sequence, namely predicting the label corresponding to the input sequence.
Furthermore, the specific method for carrying out named entity recognition by utilizing the RoBERTA pre-training model and the BilSTM + CRF model comprises the following steps:
step S411: and acquiring text information.
Step S412: and segmenting the text information to obtain a word set.
Step S413: and performing vector mapping on the word set by using a RoBerta pre-training model to obtain a word vector set.
Step S414: and training the word vector set by using a Bi-LSTM + CRF model to obtain a named entity of the text information.
Specifically, in the embodiment of the application, the original input is initialized through a RoBerta model, a word vector is output, the word vector is used as the input of a BiLSTM + CRF model, and then the named entity is obtained through the operation of the BiLSTM + CRF model.
Furthermore, the concrete method for obtaining the research direction entity and the skilled technical entity is as follows:
step S421: and acquiring keywords of the evaluation expert information.
Step S422: and crawling the personal homepage information of the experts and the abstract information of the published papers according to the keywords.
Step S423: and carrying out named entity identification on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity.
Step S424: and carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.
Specifically, if the number of the electric power expert texts is small, the research direction entities of the crawled electric power expert texts can be manually screened; more specifically, since the entity of the research method of the power expert text and the entity of the use method of the power project have comparability, the embodiment of the present application adopts the same model as the entity of the use method of the text for identifying the power project to the entity of the research method of the power expert text.
Step S5: and carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.
Step S6: and carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.
Step S7: and carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.
Step S8: and carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.
Step S9: and calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.
Further, referring to fig. 2, a flow chart of a method for calculating a composite matching score according to the embodiment of the present application is schematically shown. In the embodiment of the present application, a specific method for obtaining the comprehensive matching score is as follows:
step S91: and performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score.
Further, the calculation of the euclidean distance is performed by the following method:
Figure BDA0003204690640000051
wherein D is the entity similarity score of the domain (direction) level, F is the set corresponding to the related domain entities, R is the set corresponding to the research direction entities,
Figure BDA0003204690640000052
embedding the pictographic space corresponding to the entity in the F set,
Figure BDA0003204690640000053
embedding the pictographic space corresponding to the entity in the R set,
Figure BDA0003204690640000054
embedding semantic space corresponding to the entities in the F set,
Figure BDA0003204690640000055
and embedding the semantic space corresponding to the entities in the R set.
Step S92: and performing cosine similarity calculation on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space to obtain a first semantic matching score.
Further, the cosine similarity calculation is performed by adopting the following method:
Figure BDA0003204690640000061
wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,
Figure BDA0003204690640000062
embedding the pictographic space corresponding to the entity in the O set,
Figure BDA0003204690640000063
embedding the pictographic space corresponding to the entity in the L set,
Figure BDA0003204690640000064
embedding semantic space corresponding to the entities in the O set,
Figure BDA0003204690640000065
and embedding the semantic space corresponding to the entity in the L set.
Step S93: and performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score.
Step S94: and performing cosine similarity calculation on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space to obtain a second semantic matching score.
Step S95: and summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score.
Step S96: and summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score.
Step S97: and carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.
Further, the following method is adopted for calculating the comprehensive matching score:
score=k×D+(1-k)×T
in the formula, score is a composite matching score, and k is a hyper-parameter, i.e., a weight, which represents the matching importance at the domain (direction) level.
Further, in the embodiment of the present application, a greedy algorithm is used to calculate the k value, and after repeated verification, the k value in the embodiment of the present application is set to 0.3, which is most suitable.
Step S10: and determining the review experts according to the level of all the comprehensive matching scores. Specifically, the final calculated and output comprehensive matching scores are arranged in a descending order, and the expert with the highest comprehensive matching score is selected as the evaluation expert of the electric power science and technology project application.
The expert review recommendation method based on the pictograph-semantic dual feature space mapping provided by the embodiment of the present application will be explained in detail through specific embodiments.
In an embodiment of the present application, for electric power project text data, 2000 documents are selected from an electric power science and technology project declaration database as a corpus, and a research topic mainly includes: high voltage and insulation technology, motors and electrical and power systems and automation, etc. The specific embodiment of the application carries out word segmentation and stop word removal operations on the abstract of the project application and carries out labeling on the named entities. Because the method provided by the specific embodiment of the application is insensitive to long sequences, the specific embodiment of the application breaks the abstract of the project application according to the period number, and simultaneously ensures that the ratio of the number of sentences containing the required named entity labels to the number of sentences not containing the required named entity labels in the preprocessed data set is 8:1, and the total number of sentences is about 10000.
In the aspect of data set division, in the specific embodiment of the present application, 10000 electric power item texts are divided according to 8: 1: the scale of 1 is divided into a training set, a validation set and a test set.
In the aspect of data labeling, the embodiment of the present application adopts a classic BIO three-segment labeling method, that is, for each entity, the first word is labeled as "B-entity name", the subsequent word is "I-entity name", and the entity not required in this document is labeled as O.
In the word embedding module based on hierarchical representation, in the specific embodiment of the application, a pre-trained RoBerta model maps words into 1024-dimensional vectors and introduces the training of a named entity recognition model BiLSTM + CRF.
Specifically, referring to fig. 3(a) and fig. 3(b), schematic diagrams of entity recognition results of application related to domain entities and using methods provided for the embodiments of the present application are shown. As can be seen from fig. 3(a), reasonable application method entities such as a pulse current method, an equivalent circuit mathematical model, ensemble learning, and the like are identified; as can be seen from fig. 3(b), related domain entities such as transformer overhaul, power transformation project, clean power sharing, etc. are identified. In summary, after the RoBERTa pre-training model is added to the BilSTM + CRF model, the embodiment of the application can effectively extract the relevant entities of the electric power project text.
For the text data of the electric power experts, in the search of the relevant entities of the experts, 8 laboratories under 3 large laboratories of the electrical academy of colleges and universities are selected in the specific embodiment of the application, and each of professor (Bo director), subsidiary professor (Bo director) and subsidiary professor (Master director) is selected from each laboratory, and 24 experts are used for information extraction. The whole process is divided into the technical entity crawling and the research direction entity crawling, and the method comprises the following steps:
(1) searching the expert name and the keywords of the school in the known network, crawling the abstract of the published article, identifying the named entity, and extracting the method used in the published article as the expert skilled technical entity.
(2) Crawling the research direction of the main page of the expert by using a crawler technology, searching the research direction entity after word segmentation operation (the part has small workload and adopts a manual screening mode), and taking the search result as the research direction entity of the expert. The screening results (in part) are shown in table 1.
Table 1 entity screening results (parts) of expert data
Figure BDA0003204690640000071
As can be seen from table 1, the entity of the research direction of the expert homepage is comparable to the entity of the application method in the published papers and the two entities of the application of the power science and technology project to a certain extent, which provides a basis for the subsequent entity matching process.
Through the processing, the results of preprocessing two heterogeneous data, namely an electric power science and technology project application form and an expert, are obtained, namely four types of entities with a certain relation. Then, the embodiment of the application adopts pictograph-semantic dual-feature space matching to match the four types of entities, and the specific matching process is shown in fig. 4.
As can be seen from FIG. 4, for the matching of these four types of entities, the embodiment of the present application adopts stroke-based pictographic space mapping and sequence information-based semantic space mapping to map the entities into feature vectors. In the matching process, the research direction entities of the field-related entities and the experts of the electric power science and technology project application can perform similarity comparison on the same layer, and the use method entities of the electric power science and technology project application and the expert skilled technical entities can perform similarity comparison on the same layer. The specific process is as follows:
(1) and mapping the four entities into 512-dimensional feature vectors respectively through a cw2vec model at the pictographic level and a RoBERTA model at the semantic level.
(2) And performing full-array Euclidean distance and cosine similarity calculation on the related field entity, the research direction entity, the using method entity and the adequacy technical entity in the pictographic feature space and the semantic feature space respectively, and taking the sum of the highest values of the two as an entity matching score.
(3) And carrying out weighted synthesis on the matching scores of the electric power science and technology project application form and the expert at the field (direction) level and the method (technology) level to finally obtain the comprehensive matching score. Wherein, the matching score weight of the domain (direction) level is set to 0.3, and the matching score weight of the method (technology) level is set to 0.7.
(4) And taking the expert with the highest comprehensive matching score as a review expert of the electric power science and technology project application.
In order to verify the effectiveness of the image-semantic dual-feature space mapping matching algorithm, three groups of comparison experiments are performed in the specific embodiment of the application, namely semantic space mapping + cosine similarity matching, image space mapping + cosine similarity matching and image-semantic dual-feature space mapping + cosine similarity matching. The effect of matching the power item entity and the power expert entity is shown in fig. 5.
As can be seen from fig. 5, the embodiment of the present application implements heterogeneous data preprocessing for performing multi-scale representation learning on an electric power technology project application form-related electric power experts, and the accuracy of matching 2000 electric power project documents with 24 electric power expert texts reaches the highest 0.85. The result shows that the pictographic space and the semantic space can capture the information of the word semantics and the pictographic layer, the two characteristic spaces have stronger complementarity, and the entity matching is more sufficient than the entity information mapped by using a single characteristic space.
In summary, compared with the prior art, the embodiment of the present application has the following features:
(1) the idea of named entity identification and entity matching of heterogeneous data is used, an end-to-end matching method is achieved, and the whole process does not need manual participation.
(2) The pre-trained RoBerta model is introduced into the training of the named entity recognition model BilSTM + CRF, so that the training efficiency and accuracy are greatly improved.
(3) When the entities are matched, the thought of pictographic-semantic dual-feature space matching is introduced, and a more accurate matching effect is achieved.
(4) The method has excellent generalization, and can be used for expert recommendation of application books of other industries as long as corresponding documents are provided.
The application provides a review expert recommendation method based on pictograph-semantic dual-feature space mapping, which specifically comprises the following steps:
and acquiring abstract information of the electric power science and technology project application.
And carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.
Crawling the personal homepage information of the power expert and the abstract information of published papers.
And carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.
And carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.
And carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.
And carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.
And carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.
And calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.
And determining the review experts according to the level of all the comprehensive matching scores.
According to the technical scheme, the method for recommending the review experts based on the pictographic-semantic dual-feature space mapping comprises the steps of firstly utilizing a RoBerta pre-training model to hierarchically express texts, then utilizing a Bi-LSTM + CRF model to identify named entities of the electric power project texts and the electric power expert texts, then mapping the named entities into feature vectors through the pictographic-semantic dual-feature space, carrying out Euclidean distance and cosine similarity calculation on the obtained feature vectors to obtain related matching scores, carrying out weighted summation on the related matching scores to obtain comprehensive matching scores, and finally taking the expert with the highest comprehensive matching score as the review expert of the electric power project texts. The project text and domain expert entity matching strategy based on semantic-pictographic double-feature space mapping is provided, effective and accurate matching of the project and the domain expert is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims (10)

1. A review expert recommendation method based on pictograph-semantic dual-feature space mapping is characterized by comprising the following steps:
acquiring abstract information of an electric power science and technology project application form;
carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity;
crawling the personal homepage information of the power expert and the abstract information of published papers;
carrying out named entity recognition on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity;
pictographic mapping is carried out on the use method entity to obtain a pictographic use method entity, and pictographic mapping is carried out on the related field entity to obtain a pictographic related field entity;
pictographic mapping is carried out on the skilled technical entity to obtain a pictographic skilled technical entity, and pictographic mapping is carried out on the research direction entity to obtain a pictographic research direction entity;
semantic mapping is carried out on the pictograph use method entity to obtain a use method feature vector, and semantic mapping is carried out on the pictograph related field entity to obtain a related field feature vector;
semantic mapping is carried out on the pictographic excellence technical entity to obtain an excellence technical feature vector, and semantic mapping is carried out on the pictographic research direction entity to obtain a research direction feature vector;
calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector;
and determining the review experts according to the level of all the comprehensive matching scores.
2. The expert review recommendation method based on pictographic-semantic bi-feature space mapping of claim 1, wherein named entity recognition is performed using RoBERTa pre-training model and BiLSTM + CRF model.
3. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 2, wherein the specific method for named entity recognition using RoBERTa pre-training model and BiLSTM + CRF model is:
acquiring text information;
segmenting words of the text information to obtain a word set;
vector mapping is carried out on the word set by using a RoBerta pre-training model to obtain a word vector set;
and training the word vector set by using a BilSTM + CRF model to obtain the named entity of the text information.
4. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 3, wherein the specific method for obtaining the research direction entity and the skilled technical entity is:
acquiring keywords of evaluation expert information;
crawling expert personal homepage information and abstract information of published papers according to the keywords;
carrying out named entity recognition on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity;
and carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.
5. The expert review recommendation method based on the pictograph-semantic dual feature space mapping as claimed in claim 1, wherein the specific method for obtaining the comprehensive matching score by calculation is as follows:
performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score;
cosine similarity calculation is carried out on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space, and a first semantic matching score is obtained;
performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score;
cosine similarity calculation is carried out on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space, and a second semantic matching score is obtained;
summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score;
summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score;
and carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.
6. The expert review recommendation method based on the pictograph-semantic dual feature space mapping according to claim 5, characterized in that the Euclidean distance calculation is performed by adopting the following method:
Figure FDA0003204690630000021
wherein D is the entity similarity score of the domain (direction) level, F is the corresponding set of the related domain entities, and R is the related research directionA set of correspondences between the entities,
Figure FDA0003204690630000022
embedding the pictographic space corresponding to the entity in the F set,
Figure FDA0003204690630000023
embedding the pictographic space corresponding to the entity in the R set,
Figure FDA0003204690630000024
embedding semantic space corresponding to the entities in the F set,
Figure FDA0003204690630000025
and embedding the semantic space corresponding to the entities in the R set.
7. The expert review recommendation method based on pictograph-semantic dual feature space mapping according to claim 6, characterized in that the cosine similarity calculation is performed by adopting the following method:
Figure FDA0003204690630000026
wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,
Figure FDA0003204690630000027
embedding the pictographic space corresponding to the entity in the O set,
Figure FDA0003204690630000028
embedding the pictographic space corresponding to the entity in the L set,
Figure FDA0003204690630000029
embedding semantic space corresponding to the entities in the O set,
Figure FDA00032046906300000210
and embedding the semantic space corresponding to the entity in the L set.
8. The expert review recommendation method based on pictograph-semantic dual feature space mapping according to claim 7, characterized in that the following method is adopted to perform the comprehensive matching score calculation:
score=k×D+(1-k)×T
where score is the composite match score and k is the weight.
9. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 8 wherein a greedy algorithm is used to calculate the k value.
10. The expert review recommendation method based on pictographic-semantic bi-feature space mapping as claimed in claim 9 wherein the k value is set to 0.3.
CN202110913345.6A 2021-08-10 2021-08-10 Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping Active CN113569575B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110913345.6A CN113569575B (en) 2021-08-10 2021-08-10 Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110913345.6A CN113569575B (en) 2021-08-10 2021-08-10 Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping

Publications (2)

Publication Number Publication Date
CN113569575A true CN113569575A (en) 2021-10-29
CN113569575B CN113569575B (en) 2024-02-09

Family

ID=78171076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110913345.6A Active CN113569575B (en) 2021-08-10 2021-08-10 Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping

Country Status (1)

Country Link
CN (1) CN113569575B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093670A (en) * 2023-07-18 2023-11-21 北京智信佳科技有限公司 Method for realizing intelligent recommending expert in paper

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1916609A1 (en) * 2006-10-26 2008-04-30 Hierodiction Software GmbH Text analysis, transliteration and translation method and apparatus for hieroglyphic, hieratic, and demotic texts from Ancient Egyptian
CN103106195A (en) * 2013-01-21 2013-05-15 刘树根 Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members
CN103631859A (en) * 2013-10-24 2014-03-12 杭州电子科技大学 Intelligent review expert recommending method for science and technology projects
CN107343010A (en) * 2017-08-26 2017-11-10 海南大学 Towards automatic safe Situation Awareness, analysis and the warning system of typing resource
CN107977361A (en) * 2017-12-06 2018-05-01 哈尔滨工业大学深圳研究生院 The Chinese clinical treatment entity recognition method represented based on deep semantic information
CN111126069A (en) * 2019-12-30 2020-05-08 华南理工大学 Social media short text named entity identification method based on visual object guidance
CN111563380A (en) * 2019-01-25 2020-08-21 浙江大学 Named entity identification method and device
CN111782797A (en) * 2020-07-13 2020-10-16 贵州省科技信息中心 Automatic matching method for scientific and technological project review experts and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1916609A1 (en) * 2006-10-26 2008-04-30 Hierodiction Software GmbH Text analysis, transliteration and translation method and apparatus for hieroglyphic, hieratic, and demotic texts from Ancient Egyptian
CN103106195A (en) * 2013-01-21 2013-05-15 刘树根 Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members
US20150309994A1 (en) * 2013-01-21 2015-10-29 Shugen Liu Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members
CN103631859A (en) * 2013-10-24 2014-03-12 杭州电子科技大学 Intelligent review expert recommending method for science and technology projects
CN107343010A (en) * 2017-08-26 2017-11-10 海南大学 Towards automatic safe Situation Awareness, analysis and the warning system of typing resource
CN107977361A (en) * 2017-12-06 2018-05-01 哈尔滨工业大学深圳研究生院 The Chinese clinical treatment entity recognition method represented based on deep semantic information
CN111563380A (en) * 2019-01-25 2020-08-21 浙江大学 Named entity identification method and device
CN111126069A (en) * 2019-12-30 2020-05-08 华南理工大学 Social media short text named entity identification method based on visual object guidance
CN111782797A (en) * 2020-07-13 2020-10-16 贵州省科技信息中心 Automatic matching method for scientific and technological project review experts and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晓华;方强;张钰;: "科研项目专家评审策略优化推荐仿真分析", 计算机仿真, no. 09 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093670A (en) * 2023-07-18 2023-11-21 北京智信佳科技有限公司 Method for realizing intelligent recommending expert in paper

Also Published As

Publication number Publication date
CN113569575B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
Xiong et al. Sarcasm detection with self-matching networks and low-rank bilinear pooling
CN103544267B (en) Search method and device based on search recommended words
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN111639183A (en) Financial industry consensus public opinion analysis method and system based on deep learning algorithm
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN117453851B (en) Text index enhanced question-answering method and system based on knowledge graph
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111651569B (en) Knowledge base question-answering method and system in electric power field
Jayanto et al. Aspect-based sentiment analysis for hotel reviews using an improved model of long short-term memory.
Nadeem et al. Codedsi: Differentiable code search
Huang et al. Deep multimodal embedding model for fine-grained sketch-based image retrieval
CN113569575A (en) Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping
Bergam et al. Legal and political stance detection of SCOTUS language
CN117573894A (en) Knowledge graph-based resource recommendation system and method
CN112989811A (en) BilSTM-CRF-based historical book reading auxiliary system and control method thereof
Farrelly et al. Current topological and machine learning applications for bias detection in text
CN117114000A (en) Colorectal tumor pathology text named entity identification method and system
CN113407776A (en) Label recommendation method and device, training method and medium of label recommendation model
CN109902231A (en) Education resource recommended method based on CBOW model
Algosaibi et al. Using the semantics inherent in sitemaps to learn ontologies
Sharma et al. A survey on sentiment analysis of twitter using machine learning
Mollá et al. Query-focused extractive summarisation for finding ideal answers to biomedical and COVID-19 questions
Chi et al. WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories
Gu et al. Domain-specific language model pre-training for Korean tax law classification
Wu et al. Research on entity recognition and alignment methods in knowledge graph construction of multi-source tourism data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant