CN113569575A - Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping - Google Patents
Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping Download PDFInfo
- Publication number
- CN113569575A CN113569575A CN202110913345.6A CN202110913345A CN113569575A CN 113569575 A CN113569575 A CN 113569575A CN 202110913345 A CN202110913345 A CN 202110913345A CN 113569575 A CN113569575 A CN 113569575A
- Authority
- CN
- China
- Prior art keywords
- entity
- semantic
- pictographic
- mapping
- expert
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 124
- 238000013507 mapping Methods 0.000 title claims abstract description 71
- 238000011156 evaluation Methods 0.000 title claims description 12
- 238000012552 review Methods 0.000 claims abstract description 42
- 239000013598 vector Substances 0.000 claims description 83
- 238000011160 research Methods 0.000 claims description 62
- 238000005516 engineering process Methods 0.000 claims description 33
- 238000012549 training Methods 0.000 claims description 28
- 238000004364 calculation method Methods 0.000 claims description 24
- 230000009977 dual effect Effects 0.000 claims description 13
- 230000009193 crawling Effects 0.000 claims description 11
- 239000002131 composite material Substances 0.000 claims description 5
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012550 audit Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009413 insulation Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application relates to the technical field of expert recommendation, and provides an expert review recommendation method based on pictographic-semantic dual-feature space mapping. The entity matching strategy based on semantic-pictographic dual-feature space mapping is provided, effective and accurate matching of projects and experts is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.
Description
Technical Field
The application relates to the technical field of expert recommendation, in particular to a review expert recommendation method based on pictograph-semantic dual-feature space mapping.
Background
With the fact that theoretical innovation of national power grids is greatly promoted in the aspects of extra-high voltage alternating current and direct current power grids, smart power grids, third industrial revolution and the like, the application amount of various innovative electric power science and technology projects is greatly increased, and further the application number of the electric power science and technology projects is continuously increased.
In this case, the current review task of the power science and technology project application is difficult and burdensome, and the form review to the content quality review needs to be completed with high quality and high efficiency. The most important link in the auditing process is that the expert audits the content quality of the application, so that the accurate evaluation result can be obtained only by matching the technology and the field mastered by the auditing expert with the content of the application, and the reliability of the evaluation result is directly hooked with the matching degree.
However, most of the matching work of the review experts and the project application is manually and randomly issued at present or is specially recommended by talents with profound expertise. Due to the subjective initiative inevitably existing in the manual operation and the matching mode of the current evaluation experts and the project application form, the labor cost of evaluation work is too high, the reliability of evaluation results is poor, and the overall evaluation efficiency is low.
Disclosure of Invention
In order to overcome the defects of the prior art, the application aims to provide a recommendation method of review experts based on pictograph-semantic dual-feature space mapping, so as to solve at least one technical problem of overhigh labor cost of review work, weaker reliability of review results and lower overall review efficiency.
In order to achieve the above object, the present application provides a review expert recommendation method based on pictograph-semantic dual feature space mapping, which specifically includes:
and acquiring abstract information of the electric power science and technology project application.
And carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.
Crawling the personal homepage information of the power expert and the abstract information of published papers.
And carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.
And carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.
And carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.
And carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.
And carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.
And calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.
And determining the review experts according to the level of all the comprehensive matching scores.
Further, named entity recognition is carried out by utilizing a RoBERTA pre-training model and a BilSTM + CRF model.
Further, the specific method for carrying out named entity recognition by utilizing the RoBERTA pre-training model and the BilSTM + CRF model comprises the following steps:
and acquiring text information.
And segmenting the text information to obtain a word set.
And performing vector mapping on the word set by using a RoBerta pre-training model to obtain a word vector set.
And training the word vector set by using a BilSTM + CRF model to obtain the named entity of the text information.
Further, the specific method for obtaining the research direction entity and the adept technical entity comprises the following steps:
and acquiring keywords of the evaluation expert information.
And crawling the personal homepage information of the experts and the abstract information of the published papers according to the keywords.
And carrying out named entity identification on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity.
And carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.
Further, the specific method for obtaining the comprehensive matching score by calculation is as follows:
and performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score.
And performing cosine similarity calculation on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space to obtain a first semantic matching score.
And performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score.
And performing cosine similarity calculation on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space to obtain a second semantic matching score.
And summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score.
And summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score.
And carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.
Further, the calculation of the euclidean distance is performed by the following method:
wherein D is the entity similarity score of the domain (direction) level, F is the set corresponding to the related domain entities, R is the set corresponding to the research direction entities,embedding the pictographic space corresponding to the entity in the F set,embedding the pictographic space corresponding to the entity in the R set,embedding semantic space corresponding to the entities in the F set,and embedding the semantic space corresponding to the entities in the R set.
Further, the cosine similarity calculation is performed by adopting the following method:
wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,embedding the pictographic space corresponding to the entity in the O set,embedding the pictographic space corresponding to the entity in the L set,embedding semantic space corresponding to the entities in the O set,and embedding the semantic space corresponding to the entity in the L set.
Further, the following method is adopted for calculating the comprehensive matching score:
score=k×D+(1-k)×T
where score is the composite match score and k is the weight.
Further, a greedy algorithm is adopted to calculate the k value.
Further, the k value is set to 0.3.
The method comprises the steps of firstly, utilizing a RoBerta pre-training model to hierarchically express a text, then using a Bi-LSTM + CRF model to identify a named entity of an electric power project text and an electric power expert text, then mapping the named entity into a feature vector through the Roberta-semantic dual feature space, carrying out Euclidean distance and cosine similarity calculation on the obtained feature vector to obtain a related matching score, then carrying out weighted summation on the related matching score to obtain a comprehensive matching score, and finally taking an expert with the highest comprehensive matching score as an expert for reviewing the electric power project text. The project text and domain expert entity matching strategy based on semantic-pictographic double-feature space mapping is provided, effective and accurate matching of the project and the domain expert is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flowchart of a recommendation method for review experts based on pictograph-semantic dual feature space mapping according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a method for calculating a composite matching score according to an embodiment of the present disclosure;
fig. 3(a) is a schematic diagram of a result of identifying an entity in a field related to an application provided in an embodiment of the present application;
fig. 3(b) is a schematic diagram of an entity identification result of an application usage method provided in the embodiment of the present application;
fig. 4 is a schematic diagram of a heterogeneous matching process based on dual feature space mapping according to an embodiment of the present application;
fig. 5 is a schematic diagram illustrating comparison between matching effects of an electric power project entity and an electric power expert entity according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be fully and clearly described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to facilitate understanding of the technical solutions of the present application, some concepts related to the present application are first described below.
The RoBERTA model is an improved Chinese pre-training model of BERT, and compared with the traditional BERT, the RoBERTA model increases the Batch size, introduces a dynamic Masking mechanism, expands a training sample, and removes the constraint of an NSP (next presence prediction) item in a loss function. Specifically, the Batch size of the model was increased from 256 to 8000, and 10 different Masking methods were used, so that the samples in different epochs were not masked by the fixed Masking, and the training data was changed from 13G to 160G.
Specifically, the RoBerta model input is composed of a word vector, a sentence vector and a position quantity. The word vector comprises a coding vector of the category symbol and a coding vector of the separator; the sentence vector is a coding vector used for distinguishing different sentences; the position quantity is a coding vector of corresponding positions of different words in the sentence. The model output is a word embedding matrix after all words of the sentence are coded by the self-attention coder.
The Recurrent Neural Network (RNN) is the most widely applied neural network in the task of sequence relation learning, and the bidirectional long-and-short time memory network (Bi-LSTM) is a variant of the RNN, has bidirectional time sequence characteristics and a special gate control structure, and can effectively solve the problems of gradient disappearance and explosion.
Conditional Random Fields (CRF) are a commonly used sequence labeling algorithm.
The matching work of the experts and the project application can be regarded as heterogeneous data matching, and the heterogeneous data matching is that different sources of data are optimized in a preprocessing mode and then matched with each other, and finally reasonable output is obtained. The preprocessing of the project application data relates to a technology of named entity identification, namely, the proprietary names such as the names of people, places and organizations in the text are identified and classified, and the preprocessing of the expert data can be extracted from a webpage and then is finished through a crawler technology. After the data processed by the two are obtained, a more accurate and efficient result than manual recommendation can be obtained by means of heterogeneous matching
Specifically, in identifying the named entities of the power project application, the following concepts are first defined:
(1) the use method entity comprises the following steps: the methods used in the application, such as: zero sequence harmonic component principle, thermal step current method, electromagnetic coupling principle.
(2) Relating to a field entity: the application relates to the fields such as: reactive compensation, power transformation engineering and economic power transmission.
Referring to fig. 1, a schematic flow chart of a review expert recommendation method based on pictograph-semantic dual feature space mapping provided in an embodiment of the present application is shown. The embodiment of the application provides a review expert recommendation method based on pictograph-semantic dual-feature space mapping, which specifically comprises the following steps:
step S1: and acquiring abstract information of the electric power science and technology project application.
Step S2: and carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.
Step S3: crawling the personal homepage information of the power expert and the abstract information of published papers.
Step S4: and carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.
Further, named entity recognition is carried out by utilizing a RoBERTA pre-training model and a BilSTM + CRF model. Specifically, the feature extractor of the RoBerta model is a bidirectional Transformer, and each unit of the Transformer is composed of a self-Attention layer (self-Attention), a Feed-Forward neural Network (Feed Forward Network) and a Normalization layer (Add & Normalization), and the structure can make full use of context information to capture the dependency relationship of longer distance.
In the embodiment of the present application, the BiLSTM model takes the score matrix of each word and label as output, which is called "emission matrix" a, and specifically includes: and (3) taking the mapped value of the word hidden layer vector through a linear layer (namely, using BilSTM as the last step of classification, and mapping the hidden state into a score) as a score matrix of the label corresponding to the word.
Meanwhile, the embodiment of the application selects a linear CRF model to learn the internal relation among the labels in the sequence, namely predicting the label corresponding to the input sequence.
Furthermore, the specific method for carrying out named entity recognition by utilizing the RoBERTA pre-training model and the BilSTM + CRF model comprises the following steps:
step S411: and acquiring text information.
Step S412: and segmenting the text information to obtain a word set.
Step S413: and performing vector mapping on the word set by using a RoBerta pre-training model to obtain a word vector set.
Step S414: and training the word vector set by using a Bi-LSTM + CRF model to obtain a named entity of the text information.
Specifically, in the embodiment of the application, the original input is initialized through a RoBerta model, a word vector is output, the word vector is used as the input of a BiLSTM + CRF model, and then the named entity is obtained through the operation of the BiLSTM + CRF model.
Furthermore, the concrete method for obtaining the research direction entity and the skilled technical entity is as follows:
step S421: and acquiring keywords of the evaluation expert information.
Step S422: and crawling the personal homepage information of the experts and the abstract information of the published papers according to the keywords.
Step S423: and carrying out named entity identification on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity.
Step S424: and carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.
Specifically, if the number of the electric power expert texts is small, the research direction entities of the crawled electric power expert texts can be manually screened; more specifically, since the entity of the research method of the power expert text and the entity of the use method of the power project have comparability, the embodiment of the present application adopts the same model as the entity of the use method of the text for identifying the power project to the entity of the research method of the power expert text.
Step S5: and carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.
Step S6: and carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.
Step S7: and carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.
Step S8: and carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.
Step S9: and calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.
Further, referring to fig. 2, a flow chart of a method for calculating a composite matching score according to the embodiment of the present application is schematically shown. In the embodiment of the present application, a specific method for obtaining the comprehensive matching score is as follows:
step S91: and performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score.
Further, the calculation of the euclidean distance is performed by the following method:
wherein D is the entity similarity score of the domain (direction) level, F is the set corresponding to the related domain entities, R is the set corresponding to the research direction entities,embedding the pictographic space corresponding to the entity in the F set,embedding the pictographic space corresponding to the entity in the R set,embedding semantic space corresponding to the entities in the F set,and embedding the semantic space corresponding to the entities in the R set.
Step S92: and performing cosine similarity calculation on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space to obtain a first semantic matching score.
Further, the cosine similarity calculation is performed by adopting the following method:
wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,embedding the pictographic space corresponding to the entity in the O set,embedding the pictographic space corresponding to the entity in the L set,embedding semantic space corresponding to the entities in the O set,and embedding the semantic space corresponding to the entity in the L set.
Step S93: and performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score.
Step S94: and performing cosine similarity calculation on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space to obtain a second semantic matching score.
Step S95: and summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score.
Step S96: and summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score.
Step S97: and carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.
Further, the following method is adopted for calculating the comprehensive matching score:
score=k×D+(1-k)×T
in the formula, score is a composite matching score, and k is a hyper-parameter, i.e., a weight, which represents the matching importance at the domain (direction) level.
Further, in the embodiment of the present application, a greedy algorithm is used to calculate the k value, and after repeated verification, the k value in the embodiment of the present application is set to 0.3, which is most suitable.
Step S10: and determining the review experts according to the level of all the comprehensive matching scores. Specifically, the final calculated and output comprehensive matching scores are arranged in a descending order, and the expert with the highest comprehensive matching score is selected as the evaluation expert of the electric power science and technology project application.
The expert review recommendation method based on the pictograph-semantic dual feature space mapping provided by the embodiment of the present application will be explained in detail through specific embodiments.
In an embodiment of the present application, for electric power project text data, 2000 documents are selected from an electric power science and technology project declaration database as a corpus, and a research topic mainly includes: high voltage and insulation technology, motors and electrical and power systems and automation, etc. The specific embodiment of the application carries out word segmentation and stop word removal operations on the abstract of the project application and carries out labeling on the named entities. Because the method provided by the specific embodiment of the application is insensitive to long sequences, the specific embodiment of the application breaks the abstract of the project application according to the period number, and simultaneously ensures that the ratio of the number of sentences containing the required named entity labels to the number of sentences not containing the required named entity labels in the preprocessed data set is 8:1, and the total number of sentences is about 10000.
In the aspect of data set division, in the specific embodiment of the present application, 10000 electric power item texts are divided according to 8: 1: the scale of 1 is divided into a training set, a validation set and a test set.
In the aspect of data labeling, the embodiment of the present application adopts a classic BIO three-segment labeling method, that is, for each entity, the first word is labeled as "B-entity name", the subsequent word is "I-entity name", and the entity not required in this document is labeled as O.
In the word embedding module based on hierarchical representation, in the specific embodiment of the application, a pre-trained RoBerta model maps words into 1024-dimensional vectors and introduces the training of a named entity recognition model BiLSTM + CRF.
Specifically, referring to fig. 3(a) and fig. 3(b), schematic diagrams of entity recognition results of application related to domain entities and using methods provided for the embodiments of the present application are shown. As can be seen from fig. 3(a), reasonable application method entities such as a pulse current method, an equivalent circuit mathematical model, ensemble learning, and the like are identified; as can be seen from fig. 3(b), related domain entities such as transformer overhaul, power transformation project, clean power sharing, etc. are identified. In summary, after the RoBERTa pre-training model is added to the BilSTM + CRF model, the embodiment of the application can effectively extract the relevant entities of the electric power project text.
For the text data of the electric power experts, in the search of the relevant entities of the experts, 8 laboratories under 3 large laboratories of the electrical academy of colleges and universities are selected in the specific embodiment of the application, and each of professor (Bo director), subsidiary professor (Bo director) and subsidiary professor (Master director) is selected from each laboratory, and 24 experts are used for information extraction. The whole process is divided into the technical entity crawling and the research direction entity crawling, and the method comprises the following steps:
(1) searching the expert name and the keywords of the school in the known network, crawling the abstract of the published article, identifying the named entity, and extracting the method used in the published article as the expert skilled technical entity.
(2) Crawling the research direction of the main page of the expert by using a crawler technology, searching the research direction entity after word segmentation operation (the part has small workload and adopts a manual screening mode), and taking the search result as the research direction entity of the expert. The screening results (in part) are shown in table 1.
Table 1 entity screening results (parts) of expert data
As can be seen from table 1, the entity of the research direction of the expert homepage is comparable to the entity of the application method in the published papers and the two entities of the application of the power science and technology project to a certain extent, which provides a basis for the subsequent entity matching process.
Through the processing, the results of preprocessing two heterogeneous data, namely an electric power science and technology project application form and an expert, are obtained, namely four types of entities with a certain relation. Then, the embodiment of the application adopts pictograph-semantic dual-feature space matching to match the four types of entities, and the specific matching process is shown in fig. 4.
As can be seen from FIG. 4, for the matching of these four types of entities, the embodiment of the present application adopts stroke-based pictographic space mapping and sequence information-based semantic space mapping to map the entities into feature vectors. In the matching process, the research direction entities of the field-related entities and the experts of the electric power science and technology project application can perform similarity comparison on the same layer, and the use method entities of the electric power science and technology project application and the expert skilled technical entities can perform similarity comparison on the same layer. The specific process is as follows:
(1) and mapping the four entities into 512-dimensional feature vectors respectively through a cw2vec model at the pictographic level and a RoBERTA model at the semantic level.
(2) And performing full-array Euclidean distance and cosine similarity calculation on the related field entity, the research direction entity, the using method entity and the adequacy technical entity in the pictographic feature space and the semantic feature space respectively, and taking the sum of the highest values of the two as an entity matching score.
(3) And carrying out weighted synthesis on the matching scores of the electric power science and technology project application form and the expert at the field (direction) level and the method (technology) level to finally obtain the comprehensive matching score. Wherein, the matching score weight of the domain (direction) level is set to 0.3, and the matching score weight of the method (technology) level is set to 0.7.
(4) And taking the expert with the highest comprehensive matching score as a review expert of the electric power science and technology project application.
In order to verify the effectiveness of the image-semantic dual-feature space mapping matching algorithm, three groups of comparison experiments are performed in the specific embodiment of the application, namely semantic space mapping + cosine similarity matching, image space mapping + cosine similarity matching and image-semantic dual-feature space mapping + cosine similarity matching. The effect of matching the power item entity and the power expert entity is shown in fig. 5.
As can be seen from fig. 5, the embodiment of the present application implements heterogeneous data preprocessing for performing multi-scale representation learning on an electric power technology project application form-related electric power experts, and the accuracy of matching 2000 electric power project documents with 24 electric power expert texts reaches the highest 0.85. The result shows that the pictographic space and the semantic space can capture the information of the word semantics and the pictographic layer, the two characteristic spaces have stronger complementarity, and the entity matching is more sufficient than the entity information mapped by using a single characteristic space.
In summary, compared with the prior art, the embodiment of the present application has the following features:
(1) the idea of named entity identification and entity matching of heterogeneous data is used, an end-to-end matching method is achieved, and the whole process does not need manual participation.
(2) The pre-trained RoBerta model is introduced into the training of the named entity recognition model BilSTM + CRF, so that the training efficiency and accuracy are greatly improved.
(3) When the entities are matched, the thought of pictographic-semantic dual-feature space matching is introduced, and a more accurate matching effect is achieved.
(4) The method has excellent generalization, and can be used for expert recommendation of application books of other industries as long as corresponding documents are provided.
The application provides a review expert recommendation method based on pictograph-semantic dual-feature space mapping, which specifically comprises the following steps:
and acquiring abstract information of the electric power science and technology project application.
And carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.
Crawling the personal homepage information of the power expert and the abstract information of published papers.
And carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.
And carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.
And carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.
And carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.
And carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.
And calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.
And determining the review experts according to the level of all the comprehensive matching scores.
According to the technical scheme, the method for recommending the review experts based on the pictographic-semantic dual-feature space mapping comprises the steps of firstly utilizing a RoBerta pre-training model to hierarchically express texts, then utilizing a Bi-LSTM + CRF model to identify named entities of the electric power project texts and the electric power expert texts, then mapping the named entities into feature vectors through the pictographic-semantic dual-feature space, carrying out Euclidean distance and cosine similarity calculation on the obtained feature vectors to obtain related matching scores, carrying out weighted summation on the related matching scores to obtain comprehensive matching scores, and finally taking the expert with the highest comprehensive matching score as the review expert of the electric power project texts. The project text and domain expert entity matching strategy based on semantic-pictographic double-feature space mapping is provided, effective and accurate matching of the project and the domain expert is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.
Claims (10)
1. A review expert recommendation method based on pictograph-semantic dual-feature space mapping is characterized by comprising the following steps:
acquiring abstract information of an electric power science and technology project application form;
carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity;
crawling the personal homepage information of the power expert and the abstract information of published papers;
carrying out named entity recognition on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity;
pictographic mapping is carried out on the use method entity to obtain a pictographic use method entity, and pictographic mapping is carried out on the related field entity to obtain a pictographic related field entity;
pictographic mapping is carried out on the skilled technical entity to obtain a pictographic skilled technical entity, and pictographic mapping is carried out on the research direction entity to obtain a pictographic research direction entity;
semantic mapping is carried out on the pictograph use method entity to obtain a use method feature vector, and semantic mapping is carried out on the pictograph related field entity to obtain a related field feature vector;
semantic mapping is carried out on the pictographic excellence technical entity to obtain an excellence technical feature vector, and semantic mapping is carried out on the pictographic research direction entity to obtain a research direction feature vector;
calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector;
and determining the review experts according to the level of all the comprehensive matching scores.
2. The expert review recommendation method based on pictographic-semantic bi-feature space mapping of claim 1, wherein named entity recognition is performed using RoBERTa pre-training model and BiLSTM + CRF model.
3. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 2, wherein the specific method for named entity recognition using RoBERTa pre-training model and BiLSTM + CRF model is:
acquiring text information;
segmenting words of the text information to obtain a word set;
vector mapping is carried out on the word set by using a RoBerta pre-training model to obtain a word vector set;
and training the word vector set by using a BilSTM + CRF model to obtain the named entity of the text information.
4. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 3, wherein the specific method for obtaining the research direction entity and the skilled technical entity is:
acquiring keywords of evaluation expert information;
crawling expert personal homepage information and abstract information of published papers according to the keywords;
carrying out named entity recognition on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity;
and carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.
5. The expert review recommendation method based on the pictograph-semantic dual feature space mapping as claimed in claim 1, wherein the specific method for obtaining the comprehensive matching score by calculation is as follows:
performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score;
cosine similarity calculation is carried out on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space, and a first semantic matching score is obtained;
performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score;
cosine similarity calculation is carried out on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space, and a second semantic matching score is obtained;
summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score;
summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score;
and carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.
6. The expert review recommendation method based on the pictograph-semantic dual feature space mapping according to claim 5, characterized in that the Euclidean distance calculation is performed by adopting the following method:
wherein D is the entity similarity score of the domain (direction) level, F is the corresponding set of the related domain entities, and R is the related research directionA set of correspondences between the entities,embedding the pictographic space corresponding to the entity in the F set,embedding the pictographic space corresponding to the entity in the R set,embedding semantic space corresponding to the entities in the F set,and embedding the semantic space corresponding to the entities in the R set.
7. The expert review recommendation method based on pictograph-semantic dual feature space mapping according to claim 6, characterized in that the cosine similarity calculation is performed by adopting the following method:
wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,embedding the pictographic space corresponding to the entity in the O set,embedding the pictographic space corresponding to the entity in the L set,embedding semantic space corresponding to the entities in the O set,and embedding the semantic space corresponding to the entity in the L set.
8. The expert review recommendation method based on pictograph-semantic dual feature space mapping according to claim 7, characterized in that the following method is adopted to perform the comprehensive matching score calculation:
score=k×D+(1-k)×T
where score is the composite match score and k is the weight.
9. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 8 wherein a greedy algorithm is used to calculate the k value.
10. The expert review recommendation method based on pictographic-semantic bi-feature space mapping as claimed in claim 9 wherein the k value is set to 0.3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110913345.6A CN113569575B (en) | 2021-08-10 | 2021-08-10 | Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110913345.6A CN113569575B (en) | 2021-08-10 | 2021-08-10 | Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113569575A true CN113569575A (en) | 2021-10-29 |
CN113569575B CN113569575B (en) | 2024-02-09 |
Family
ID=78171076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110913345.6A Active CN113569575B (en) | 2021-08-10 | 2021-08-10 | Evaluation expert recommendation method based on pictographic-semantic dual-feature space mapping |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113569575B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117093670A (en) * | 2023-07-18 | 2023-11-21 | 北京智信佳科技有限公司 | Method for realizing intelligent recommending expert in paper |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1916609A1 (en) * | 2006-10-26 | 2008-04-30 | Hierodiction Software GmbH | Text analysis, transliteration and translation method and apparatus for hieroglyphic, hieratic, and demotic texts from Ancient Egyptian |
CN103106195A (en) * | 2013-01-21 | 2013-05-15 | 刘树根 | Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members |
CN103631859A (en) * | 2013-10-24 | 2014-03-12 | 杭州电子科技大学 | Intelligent review expert recommending method for science and technology projects |
CN107343010A (en) * | 2017-08-26 | 2017-11-10 | 海南大学 | Towards automatic safe Situation Awareness, analysis and the warning system of typing resource |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN111126069A (en) * | 2019-12-30 | 2020-05-08 | 华南理工大学 | Social media short text named entity identification method based on visual object guidance |
CN111563380A (en) * | 2019-01-25 | 2020-08-21 | 浙江大学 | Named entity identification method and device |
CN111782797A (en) * | 2020-07-13 | 2020-10-16 | 贵州省科技信息中心 | Automatic matching method for scientific and technological project review experts and storage medium |
-
2021
- 2021-08-10 CN CN202110913345.6A patent/CN113569575B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1916609A1 (en) * | 2006-10-26 | 2008-04-30 | Hierodiction Software GmbH | Text analysis, transliteration and translation method and apparatus for hieroglyphic, hieratic, and demotic texts from Ancient Egyptian |
CN103106195A (en) * | 2013-01-21 | 2013-05-15 | 刘树根 | Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members |
US20150309994A1 (en) * | 2013-01-21 | 2015-10-29 | Shugen Liu | Ideographical member identification and extraction method and machine-translation and manual-correction interactive translation method based on ideographical members |
CN103631859A (en) * | 2013-10-24 | 2014-03-12 | 杭州电子科技大学 | Intelligent review expert recommending method for science and technology projects |
CN107343010A (en) * | 2017-08-26 | 2017-11-10 | 海南大学 | Towards automatic safe Situation Awareness, analysis and the warning system of typing resource |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN111563380A (en) * | 2019-01-25 | 2020-08-21 | 浙江大学 | Named entity identification method and device |
CN111126069A (en) * | 2019-12-30 | 2020-05-08 | 华南理工大学 | Social media short text named entity identification method based on visual object guidance |
CN111782797A (en) * | 2020-07-13 | 2020-10-16 | 贵州省科技信息中心 | Automatic matching method for scientific and technological project review experts and storage medium |
Non-Patent Citations (1)
Title |
---|
王晓华;方强;张钰;: "科研项目专家评审策略优化推荐仿真分析", 计算机仿真, no. 09 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117093670A (en) * | 2023-07-18 | 2023-11-21 | 北京智信佳科技有限公司 | Method for realizing intelligent recommending expert in paper |
Also Published As
Publication number | Publication date |
---|---|
CN113569575B (en) | 2024-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xiong et al. | Sarcasm detection with self-matching networks and low-rank bilinear pooling | |
CN103544267B (en) | Search method and device based on search recommended words | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
CN111639183A (en) | Financial industry consensus public opinion analysis method and system based on deep learning algorithm | |
CN112084435A (en) | Search ranking model training method and device and search ranking method and device | |
CN117453851B (en) | Text index enhanced question-answering method and system based on knowledge graph | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN111651569B (en) | Knowledge base question-answering method and system in electric power field | |
Jayanto et al. | Aspect-based sentiment analysis for hotel reviews using an improved model of long short-term memory. | |
Nadeem et al. | Codedsi: Differentiable code search | |
Huang et al. | Deep multimodal embedding model for fine-grained sketch-based image retrieval | |
CN113569575A (en) | Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping | |
Bergam et al. | Legal and political stance detection of SCOTUS language | |
CN117573894A (en) | Knowledge graph-based resource recommendation system and method | |
CN112989811A (en) | BilSTM-CRF-based historical book reading auxiliary system and control method thereof | |
Farrelly et al. | Current topological and machine learning applications for bias detection in text | |
CN117114000A (en) | Colorectal tumor pathology text named entity identification method and system | |
CN113407776A (en) | Label recommendation method and device, training method and medium of label recommendation model | |
CN109902231A (en) | Education resource recommended method based on CBOW model | |
Algosaibi et al. | Using the semantics inherent in sitemaps to learn ontologies | |
Sharma et al. | A survey on sentiment analysis of twitter using machine learning | |
Mollá et al. | Query-focused extractive summarisation for finding ideal answers to biomedical and COVID-19 questions | |
Chi et al. | WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories | |
Gu et al. | Domain-specific language model pre-training for Korean tax law classification | |
Wu et al. | Research on entity recognition and alignment methods in knowledge graph construction of multi-source tourism data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |