CN112418695A - Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field - Google Patents

Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field Download PDF

Info

Publication number
CN112418695A
CN112418695A CN202011362431.4A CN202011362431A CN112418695A CN 112418695 A CN112418695 A CN 112418695A CN 202011362431 A CN202011362431 A CN 202011362431A CN 112418695 A CN112418695 A CN 112418695A
Authority
CN
China
Prior art keywords
scientific research
scientific
personnel
researchers
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011362431.4A
Other languages
Chinese (zh)
Inventor
王永胜
郑新章
冯伟华
刘亚丽
王锐
贾楠
郑路
宗国浩
王迪
洪群业
邱纪青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Tobacco Research Institute of CNTC
Original Assignee
Zhengzhou Tobacco Research Institute of CNTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Tobacco Research Institute of CNTC filed Critical Zhengzhou Tobacco Research Institute of CNTC
Priority to CN202011362431.4A priority Critical patent/CN112418695A/en
Publication of CN112418695A publication Critical patent/CN112418695A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Educational Administration (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-dimensional portrait construction method and a recommendation method for scientific researchers in the tobacco field, which comprises the following steps: 1) acquiring tobacco scientific research data, establishing relevant information of scientific research personnel participating in scientific research results and marking the ranking of the scientific research personnel in the participating scientific research results; 2) fusing repeated data in scientific research personnel and scientific research achievements; 3) establishing a basic attribute dimension of scientific researchers based on the tobacco scientific research data processed in the step 2); 4) establishing academic influence dimensionality of scientific researchers based on the scientific researchers and related achievement data thereof; 5) establishing scientific research relation dimensionality of scientific researchers based on the tobacco scientific research data processed in the step 2); 6) classifying the scientific research achievements of each scientific research personnel, and establishing the dimensionality of the scientific research achievements of the scientific research personnel; 7) establishing scientific research interest dimensions of the scientific researchers for each scientific research result of the scientific researchers; 8) and visualizing the scientific research personnel according to the dimension information to obtain a multi-dimensional portrait of the scientific research personnel.

Description

Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field
Technical Field
The invention relates to a tobacco field-oriented multi-dimensional portrait and visual presentation method and a recommendation method for scientific researchers, and belongs to the technical field of data processing.
Background
The user portrait is a user model constructed by extracting and presenting information such as attribute features, interests, social relationships, and the like of a user. Scientific researchers are the main bodies of scientific activities, relevant information of the scientific researchers is an important knowledge resource, users of the scientific researchers in the tobacco field draw pictures, the scientific research dimension information of the scientific researchers is concerned more, and the content such as academic attribute characteristics, academic research interest, academic achievements, scientific research cooperative relations and field academic influence of the tobacco field is described more easily. The tobacco scientific research personnel portrait relates to the content and the technology of the aspects of the structuralization processing of academic information, the duplication removal of the academic information (such as the duplication removal of authors, the duplication removal of achievements and the like), the construction of a special stop word lexicon in the tobacco field, the mining of academic research interests, the recommendation of similar academic groups, the analysis of academic cooperative relations and the like. Scientific research data in the tobacco field are scattered and management systems are different, scientific research data are fused to further construct a unified data management platform, and besides, research achievements of tobacco scientific research personnel in specific research fields are deeply depicted and presented, so that the system is an important application scene for deep mining and analysis of tobacco data. By constructing the multi-dimensional portrait model for researchers in the tobacco field, other researchers can quickly acquire and pay attention to research work done by researchers having influence in the tobacco field. Through portrait information visualization, other personnel can more intuitively know the research condition of interested scientific research personnel, and the experience of using the tobacco information service platform by a user is improved. Based on the purpose, scientific research information of scientific research personnel in the tobacco field is described through the multidimensional portrait technology, and user information acquisition experience is improved through a visualization method and the technology.
At present, a series of important scientific and technological specialties and scientific and technological projects are implemented in the field of tobacco scientific research through scientific and technological innovation, a series of important scientific research papers, patents, writings and the like are produced, and a large number of important research achievements and research achievements appear along with the continuous accumulation of tobacco data. The data output provides an important data base for deeply mining the relevant knowledge of the scientific researchers and constructing the multi-dimensional portrait of the tobacco scientific researchers based on the tobacco scientific research data.
Disclosure of Invention
The invention aims to construct a tobacco field-oriented scientific research personnel portrait construction method, a visual presentation method and a visual presentation device, which are used for solving the problems of accurate description of scientific research academic information of tobacco field scientific research personnel and visual presentation of corresponding information, so that the user experience of a tobacco field knowledge service platform is improved.
The academic achievement expression form of the tobacco field scientific researchers comprises a thesis, a monograph, a patent, a standard, a scientific achievement and the like, and the achievement data are all important expression forms of the academic achievement of the tobacco field scientific researchers, so that the establishment of an academic influence index system of the tobacco scientific researchers comprising all the achievement forms is one of important innovations of the invention. Meanwhile, in order to deeply depict the research interest or research direction of tobacco researchers, the method needs to extract representative academic keywords from questions, keywords and abstract data in tobacco result data, construct a stop-use word bank in the tobacco scientific research field by combining the word characteristics for tobacco scientific research after performing word segmentation processing on the result text data, and remove words after the word segmentation processing, so that important keywords which can better reflect the research characteristics in the tobacco field are presented, and the method is also an important innovation in the method. In addition, the scientific research cooperation data of the scientific research personnel represents the data generated by cooperation with other people, including the calculation based on the number of the treatises to embody the relationship strength, and is obtained by analyzing the information of the scientific research results.
In order to achieve the above object, the scheme of the invention comprises:
the invention discloses a tobacco scientific research personnel portrait construction method, which comprises the following steps:
1) and acquiring the tobacco scientific research original data. The tobacco scientific research original data is from various scientific research heterogeneous systems in the tobacco industry, and the data mainly selects basic attribute data and scientific research achievement data of scientific researchers. The basic attribute data of the scientific researchers comprise names, sexes, organizations, research directions, contact calls, electronic mailboxes and the like of the scientific researchers. In the scientific research result data, the types of results include treatises, patents, standards, monographs and scientific and technological results, and various result contents include titles, keywords, abstracts, publication dates, authors and the like. And establishing the association information and the participation ranking information of scientific research personnel participating in the scientific research results between the scientific research personnel and the scientific research result data. The data is obtained according to the actual data contained in the multi-source heterogeneous system.
2) And fusing repeated data of scientific research personnel and scientific research results based on the data in the step 1). The step of formulating the deduplication rule comprises the following steps: whether the scientific research personnel have the same name, organization, contact telephone, email, etc.; whether the scientific research personnel with the same name have a cooperative relationship with the same scientific research personnel or not; whether the scientific researchers with the same name have the same research results or not is used as a judgment condition to eliminate repeated scientific research data. The scientific research results mainly take the results subject, author, date and the like as main judgment conditions for removing the duplication.
3) Establishing basic attribute dimensions of scientific researchers based on the data processed in the step 2), wherein the attribute information comprises names, sexes, organizations, research directions and research results.
4) Establishing academic influence dimensionality of the scientific researchers based on the scientific researchers and the related achievement data of the scientific researchers in the step 3). According to the method, the academic output influence of scientific researchers is obtained through a designed scientific researcher influence index system model. The output impact of researchers is divided into 5 dimensions including treatises, monographs, patents, standards, and scientific achievements. And respectively calculating the output influence value of each type of achievement dimension according to the constructed influence index system of the scientific researchers. And acquiring numerical values of dimensionality influence of each academic achievement type of scientific research personnel according to the influence model, and visually presenting the numerical values through a radar map. By constructing academic influence dimensionality and visualization, the academic influence which is generated by the academic research of scientific research personnel in the tobacco field and is used for establishing the academic influence can be intuitively obtained.
5) And establishing scientific research relation dimensions of scientific researchers. And (3) acquiring data of scientific research result participators based on the data processed in the step 2), and establishing a scientific research personnel cooperative relationship network, wherein the establishment of the relationship is based on the cooperative relationship of the scientific research personnel in the results. Two scientific researchers participate in the research work of the same academic achievement at the same time, namely, the research work is regarded as 1-time cooperation, and the research work is regarded as M-time scientific research cooperation when the two scientific researchers participate in the research of M achievements at the same time. Some important scientific researchers with influence have huge cooperation networks, and scientific researchers with relatively few cooperation times interfere with the visual presentation of the cooperation relationship dimension of the scientific researchers. In order to reduce the appearance of non-important or accidental cooperative relations, the invention screens scientific research cooperative data, deletes the scientific research cooperative partners with the cooperative frequency lower than a set threshold value for each scientific research personnel, obtains the cooperative relations with the scientific research partners with important values, and reserves the scientific research cooperative partners as the scientific research relation dimensionalities of the corresponding scientific research personnel.
6) And establishing the dimension of scientific research achievements of scientific research personnel. The scientific research achievement dimension mainly aims at presenting academic research achievements of scientific researchers, and the method classifies the scientific research achievements of single scientific researchers based on the data acquired in the step 3), presents the scientific research achievements of the researchers, and visually presents the scientific research achievements through flow diagrams and other forms.
7) And establishing a scientific research interest dimension of scientific researchers. On the basis of the data of the scientific research personnel and the scientific research results obtained in the steps 1), 2) and 3), obtaining the question, the keyword and the abstract information of each scientific research result; segmenting words of the text through a word segmentation toolkit; combining the characteristics of scientific research stop words in the tobacco field, constructing a stop word bank in the tobacco field, and deleting the stop words; acquiring research interest keyword labels of researchers through an LDA model; and presenting research interest dimensions through word clouds and an annual academic keyword evolution diagram.
8) Similar research interests are recommended by scientific researchers. Based on the keywords and the weight information of the interest dimension of the scientific researchers in the tobacco field in the step 7), the top Q scientific researchers most similar to the research interest of a certain scientific researcher are obtained by integrating a forgetting factor based on time into the weight of the keywords and utilizing a cosine similarity formula.
9) And visualizing the scientific research portrait dimension information in the forms of radar maps, flow maps, relationship maps and the like.
The tobacco scientific research personnel portrait construction relates to personnel attribute information, personnel cooperation relationship information, personnel achievement information, personnel influence information, personnel research interest information and the like, the tobacco scientific research personnel portrait model is constructed from the dimensions and is only presented in a character form, and the scientific research personnel academic research information is difficult to be rapidly and intuitively provided for other researchers, so that the visualization of the portrait information of the tobacco scientific research personnel is also a key problem for portrait construction of the tobacco scientific research personnel. In the era of explosive growth of information, graphical information often has advantages and application values in aspects of being more intuitive and easy to understand than text information.
Aiming at the problems, the multi-dimensional portrait information of tobacco researchers is processed by corresponding data, a stop word lexicon special for the tobacco field is constructed, and the multi-dimensional portrait of the tobacco researchers is constructed. According to the invention, a proper visualization scheme is selected to visually present the corresponding information of the portrait dimension, so that a better way is provided for a user to quickly acquire the information and improve the user experience.
Aiming at the problems, the academic influence of the tobacco scientific researchers on various research achievements is presented by constructing a scientific research influence model of the scientific research personnel and a radar map for the influence dimension of the scientific research personnel;
for the scientific research relation dimensionality of scientific research personnel, the cooperation condition of the scientific research personnel is presented by setting the lowest cooperation frequency, so that the important cooperation relation with value can be presented;
further, filtering the scientific research personnel cooperation relationship network in the step 5), wherein the filtering comprises setting the cooperation frequency between the scientific research personnel with the cooperation frequency lower than a cooperation frequency set value to be 0.
Further, the cooperation number setting value N is:
Figure BDA0002804376880000041
wherein N ismaxThe highest frequency of cooperation for the scientific research personnel in the cooperation data.
According to the method, the cooperative relationship frequency threshold N of the scientific researchers is defined according to the Proce's law, the cooperative relationship of the scientific researchers is determined through the formula, the interference of unimportant cooperative data on the visual presentation of the cooperative relationship is hopefully reduced, the valuable cooperative relationship is presented to users, and the visual presentation effect of the cooperative relationship is optimized. Visual effect inspection is carried out on actual data in the field of tobacco scientific research, and the filtering standard is scientific and effective.
For the dimension of the scientific research results, the invention mainly presents the change of the research results of the scientific researchers along with the time and the distribution condition of the research results in a flow diagram and a relation map mode.
For the scientific research interest dimension of the researchers, the extraction of research interest keywords of the researchers is carried out through the LDA model, and the extraction corresponds to the corresponding year. And presenting the interest evolution of researchers through a word cloud and a time-varying annual academic keyword visualization method.
The invention relates to a device for constructing a multi-dimensional portrait and a visualization of scientific researchers in the tobacco field, which comprises a processor and a memory, wherein the processor executes instructions stored in the memory so as to realize the method for constructing the portrait and the visualization of the scientific researchers.
Compared with the prior art, the invention has the following positive effects:
the tobacco field scientific research work is prone to multi-person cooperation, repeated data processing of scientific research personnel and scientific research results is performed by combining the characteristic that tobacco scientific research results are prone to cooperative output of scientific research personnel, besides the judgment rule based on the basic attribute of the personnel is formulated, the personnel cooperation relation is merged into the data merging rule, and the deduplication processing of corresponding data is completed quickly and effectively; determining a result cooperation frequency threshold N of scientific researchers based on the Proles' law, and only presenting the cooperation of important values when the number of scientific researchers is excessive; through 5 types of scientific research output of tobacco scientific researchers, a corresponding index system is formulated, and an influence model is constructed, so that the academic influence of the tobacco scientific researchers can be more comprehensively depicted; in the extraction of the scientific research interest keywords, according to the characteristics of tobacco scientific research data, a special stop word library for the tobacco field is constructed, the words are filtered and removed, and important keywords are extracted; research interest similarity calculation is carried out by research interest preference keywords and keyword weights of the forgetting factors, and therefore the recommendation of the research personnel with timeliness is guaranteed. Through the visualization schemes such as radar maps and flow diagrams, corresponding dimension information of scientific research personnel is presented, and visual experience and use experience of scientific research users are improved. Through actual platform tests, the scheme of the invention is active and effective, and has innovation value for improving the presenting effect of scientific research information of each dimension of scientific research personnel.
Drawings
FIG. 1 is a flow chart of a representation and visualization method for researchers in the tobacco field of the present invention;
FIG. 2 is a model for impact calculation by a researcher;
FIG. 3 is a visual chart of the influence of various research results of the researchers;
FIG. 4 is a current scientific collaboration diagram of a researcher;
FIG. 5 is an annual distribution of academic achievements;
FIG. 6 is a map of achievements by a researcher;
FIG. 7 is a chart of the annual academic keywords of a researcher.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The method comprises the following steps:
the invention discloses a construction method for portrayal and visualization of scientific research personnel, which mainly takes data in the field of tobacco scientific research as objects and explains the method. The method flow is shown in fig. 1, and specifically comprises the following steps:
s1, acquiring original image data of tobacco scientific research personnel. The tobacco scientific research original data is from various scientific research heterogeneous systems in the tobacco industry, and the data mainly selects data of scientific researchers and data of scientific research achievements. The data of the scientific researchers comprises names, sexes, organizations, research directions, contact telephones, electronic mailboxes and the like of the scientific researchers. In the scientific research result data, the types of results comprise papers, patents, standards, monographs and scientific and technological results, and various result contents comprise titles, keywords, abstracts and the like. And establishing the association information and the participation ranking information of scientific research personnel participating in the scientific research results between the scientific research personnel and the scientific research result data. The data is obtained according to the actual data contained in the multi-source heterogeneous system.
And S2, extracting and fusing original scientific research data based on metadata. For tobacco scientific research data, the invention establishes the metadata templates of scientific researchers, scientific research achievements and the relationship between the scientific researchers and the scientific research achievements shown in tables 1, 2 and 3.
TABLE 1 scientific research personnel entity basic Attribute core metadata template
Serial number Attribute name Type (B) Description of the invention
1 CODE_R long Person ID
2 NAME text Name (I)
3 GENDER text Sex
4 INSTITUTION text Name of organization
5 DOMAIN text Research directions
6 INFLUENCE long Influence of academic
7 TEL text Contact telephone
8 EMAIL text Electronic mail box
TABLE 2 entity core metadata template for scientific achievements
Serial number Attribute name Type (B) Description of the invention
1 CODE_A long Achievement ID
2 TYPE text Type of outcome
3 TITLE text Topic of questions
4 DATE int Time of release
5 KEYWORD text Keyword
6 ABSTRACT text Abstract
7 FULLTEXT text Full text link
TABLE 3 scientific research personnel and achievement relationship core metadata templates
Serial number Attribute name Type (B) Description of the invention
1 CODE_R long Person ID
2 CODE_A long Achievement ID
3 PARTICIPANT text Number of participation
And analyzing different data sources, and storing the data according to the template to form original data. The extraction templates of different data sources are formulated according to the core metadata; and then acquiring original data through the formulated data extraction template.
Due to the fact that data sources are various, record formats and content fields of original data extracted from different data sources are different, extracted scientific research personnel and scientific research achievements can be repeated, the data from different sources are integrated according to a data fusion rule, fields of the data are supplemented and perfected, and repeated data are removed. In order to ensure the data fusion quality, a fusion processing rule of scientific research personnel and scientific research result entities is formulated as shown in table 4. Through actual data inspection, the method is rapid, scientific and effective, and can ensure the duplication elimination quality of scientific research personnel and entity knowledge of scientific research achievements to the maximum extent.
Further, the data deduplication of scientific research personnel is expressed as data deduplication by adopting 3 rules. Rule 1 is that if two researchers with the same name have the same organization, telephone number or email, it is determined that the two researchers are the same; rule 2 is that if two researchers with the same name have more than 1 same associated scientific research results, it is determined that the two researchers are the same; rule 3 is that if two researchers with the same name have more than 1 identical collaborator, then it is determined that the two researchers are the same.
Further, the scientific research result is subjected to duplicate removal in a mode that the results have the same author, result name and date.
The tobacco cooperation knowledge map scientific research knowledge data fusion adopts a model-driven method, an ETL data fusion language is generated through configuration, and the ETL data fusion language is analyzed and executed through an analysis engine. Finally, the ETL data fusion language is mapped to Spark SQL execution and used for completing fusion of the original data.
Table 4 tobacco researchers and scientific achievements data fusion rule
Figure BDA0002804376880000071
And S3, constructing basic attribute dimensions of tobacco scientific researchers. Based on the data and the association in tables 1, 2, and 3 in S2, attribute information (name, gender, organization, and research direction), research results, and the like of a specific tobacco researcher are obtained, and basic attribute dimensions of the researcher are presented in a text form.
And S4, constructing an influence model of scientific research personnel. Influence model construction of scientific researchers in the tobacco field is usually constructed through citation data, but is generally limited by the influence of achievement types, citation data loss and data collection comprehensiveness, and the fact that an integrated model is difficult to construct covers academic influence of all types of research achievements of tobacco scientific researchers. By combining the characteristics of data in the tobacco field and based on scientific researchers and scientific research result data in S3, the invention constructs an influence model of the scientific researchers, and the model adopts a method of respectively considering the output of different scientific research results and formulating different index systems to measure the academic achievement and influence of one scientific research worker in the tobacco field. The scientific research personnel influence model is shown in fig. 2, corresponding weights are set for achievements of tobacco scientific research personnel, and the influence of corresponding types of achievements is calculated according to the weights and the index values. The influence of personnel is generally established by the quoting times of achievements, but not all achievements have quoting data. The influence calculation is based on the scientific research output of corresponding scientific research personnel in the tobacco scientific research field, specifically comprises a secondary index system which is respectively a scientific research output type and a scientific research output level corresponding to the scientific research output type, and also has a third-level index for partial types of scientific research outputs, such as corresponding prize winning levels (first-class prizes, second-class prizes and the like) in scientific and technological achievements. The invention is not limited to the scientific research output types and the division of more indexes. In this embodiment, the final-stage indexes of each type of output are scored, specifically, the final-stage indexes of each type of output (i.e., the output scores of each type of output) are obtained according to the basic scores of the output types and the additional scores of the corresponding second-stage and third-stage indexes, where the final-stage indexes are second-stage indexes (corresponding to patents, treatises, standards, and monographs in fig. 2) and the final-stage indexes are third-stage indexes (corresponding to scientific and technological achievements in fig. 2). Each scientific researcher multiplies the yield score of a scientific yield by a score weight based on the score of the yield, the score weight being related to the amount of contribution of the scientific researcher in the yield, which may be determined by the signature order of the yields or from the original record of the yields, such as a work log. The final influence score of the scientific research personnel is obtained by accumulating the scores of all the outputs of the scientific research personnel. Based on the model, the calculation formula of the score of a certain type of research result of the scientific research personnel is as follows:
Figure BDA0002804376880000081
wherein P is the influence score of a certain type of scientific research result of the scientific research personnel, n is the number of the certain type of scientific research result of the scientific research personnel, SiThe score of the corresponding final-stage index (namely the yield score of the corresponding yield) produced by the ith scientific research of the scientific research personnel is represented, WiThe score weight (determined by the contribution size) corresponding to the ith scientific outcome for that researcher. For example, if a technical achievement of a technician receives the three-level prize of the province level, the score of the technician for the output will be the output score of the three-level prize of the province level, and then multiplied by the score weight determined by the contribution of the technician in the technical achievement, and the weight parameters are shown in table 5 (the weight of each research achievement in the invention is set according to the experience and suggestion of the relevant experts in the tobacco field).
TABLE 5 influence index system for scientific research personnel (part of the weight of scientific achievements)
Figure BDA0002804376880000091
And S5, screening cooperative data of scientific research personnel. Based on the data of the scientific researchers, that is, the relationship between the scientific researchers and the scientific research achievements (relationship data established after the operation of deduplication processing of the personnel and the achievement data) in step S2 in table 3, a scientific research achievement cooperation relationship network (that is, cooperation frequency data) can be obtained by calculating the scientific research achievement cooperation frequency between the scientific researchers two by two. In this embodiment, two corresponding scientific research technicians produce a scientific research result together, and then consider that a cooperation is generated.
Furthermore, the cooperative relationship network of the scientific research personnel can be further filtered, and the cooperative relationship with the cooperation times lower than a set threshold value is deleted, so that the interference of accidental cooperation on the map is reduced.
Specifically, according to the plece law, the cooperation frequency threshold is positioned to N, that is, the cooperation frequency of two scientific researchers with the cooperation frequency lower than N is set to 0, and the calculation formula of N is as follows:
Figure BDA0002804376880000092
wherein N ismaxIs prepared fromThe highest frequency of cooperation of the researchers in the data is made. After the N value is calculated, two scientific researchers with the cooperation times lower than the N value consider that no cooperation exists.
For example, if the calculated N value is 2.3, two researchers a and B who have a cooperation frequency lower than 2.3 consider no cooperation.
And S6, researching interest preference label extraction by scientific research personnel.
Performing word segmentation processing on questions, keywords and abstract text information based on research results of a scientific researcher, and performing stop word deletion processing on words after word segmentation processing; extracting keywords corresponding to the subject of the research result of the scientific research personnel and the word frequency thereof based on an LDA model; and taking the keywords as interest preference labels of the scientific research personnel. The method comprises the following specific steps:
(1) processing the research result data, selecting the title, the key words and the abstract of the result and sorting the selected result into a CSV format file;
(2) the related academic achievements of scientific research personnel are subjected to word segmentation processing, and a word segmentation device in the invention adopts a crust word segmentation toolkit to perform word segmentation processing on data;
(3) the method comprises the steps of eliminating stop words, wherein a stop word bank in the tobacco scientific research field is constructed in a manual mode according to the characteristics of the stop words in the tobacco scientific research field and by combining the existing common stop word list, and the stop words are eliminated;
(4) constructing a word frequency feature matrix as shown in formula 1
Figure BDA0002804376880000101
Wherein m is the number of keywords after the division and stop of words of all scientific research achievements of a certain scientific research personnel, n is the number of scientific research achievements corresponding to a certain scientific research personnel, and w in the matrixnmRepresenting a keyword kmResult dnOf (2) is used.
Based on the word frequency matrix, the invention adopts a method of processing data in a gensim toolkit and using a bag of words, each keyword is designated by a numerical sequence number, and a certain achievement forms the following format:
[(18,1),(106,2),(134,1),(5,1),(76,1)]
this format indicates that there are 5 meaningful keywords in a result, with the keyword numbered 18 appearing 1 time, and so on.
And inputting the keywords and the number information thereof as well as the association information of the keyword numbers and the occurrence frequency in each achievement into the LDA model in genim. After the LDA model is trained, topics corresponding to each achievement of researchers (the 1 st is selected according to the probability that the topic belongs to the achievement) can be obtained, and keywords under each topic (the 10 previous keywords are selected according to the probability that the keywords under the topic belong to the topic) are used as research interest keywords. After extracting keywords from all achievements of the scientific research personnel, a keyword set is formed, the weight of the keywords is the frequency of repeated appearance of the keywords in the set, and the keywords and the frequency of the keywords for researching interest preference labels by the scientific research personnel are finally formed.
And S7, recommending by scientific research personnel with similar research interests. Since the research interest of researchers is not constant and changes with objective environment or subjective interest, such changes include new interest generation and increased or decreased interest. Therefore, a forgetting factor is introduced in the calculation of the recommendations of the researchers with similar research interests, the attenuation weight of the research interest keywords is further calculated, the research interests of the current user are updated and expressed in real time, and the groups of the researchers with similar research interests are recommended with timeliness. The method comprises the following specific steps:
and obtaining keywords and frequency thereof expressing research interest of each scientific research personnel based on the step S6, and multiplying the frequency of the keywords by the alpha forgetting factor to be used as the weight of the keyword label. The time-based forgetting factor formula is as follows:
Figure BDA0002804376880000111
in the formula, alpha is a forgetting factor of the keyword; y iscurThe current time (year); y ispubFor the earliest appearance of keywords in a studyThe year in staff achievement; hl is the half-life of the keyword decay, which can be obtained by training on a large number of keyword labels.
ω obtained by calculation in S6iFor the frequency of the ith keyword of the scientific research personnel appearing in the scientific research result of the scientific research personnel, the weight of the research interest keyword currently used for calculating the similarity is as follows:
t=ωi×α
the method is used for calculating the keyword weight of research interest similarity of scientific researchers, so that the preference degree of the user on the research interest keywords is included, and the time factor is considered.
Finding out scientific researchers with similar research interests, the invention utilizes a cosine similarity algorithm of an interest keyword set researched by the scientific researchers to calculate and obtain the first Q high-similarity scientific researchers with similar research interests with a certain scientific researcher. The calculation method is as follows:
if the similarity between the researchers A and B is calculated, KAAnd KBIs a keyword list of a scientific research personnel A and a scientific research personnel B, and K is KAAnd KBUnion of keywords, tA(Ki) And tB(Ki) Indicates the ith keyword KiThe research interest keyword weights respectively belonging to A and B, and the research interest similarity calculation formula is as follows:
Figure BDA0002804376880000112
and (3) calculating the similarity of the research interests of the scientific researchers and other scientific researchers serving as analysis data sources pairwise, obtaining the similarity values of the research interests of the scientific researchers and other scientific researchers, sequencing the similarity values according to the similarity, and taking the top N (like the top 10) scientific researchers as recommendations of groups of the research researchers with similar research interests.
S8, multi-dimensional scientific research personnel academic portrait construction and visualization
And constructing a user portrait and performing visual presentation based on the screening and processing methods of the data of S3, S4, S5 and S6.
For the academic influence dimension of scientific research personnel, calculating the influence score of each type of research result through a personal influence model, and performing visual presentation of the scientific research influence in a radar map form, as shown in FIG. 3;
the scientific research relation dimensionality is used for visually presenting data capable of embodying the cooperative relation by filtering the number of the cooperative relations in the research results of the current scientific research personnel, and the visual presentation is carried out by selecting the form shown in the figure 4, wherein the scientific research personnel is positioned at the central position, and the scientific research personnel having the cooperative relation with the scientific research personnel define the connecting line length between the scientific research personnel and the scientific research personnel according to the frequency of the cooperative relation;
scientific research achievement dimension, scientific research achievement through current scientific research personnel is based on annual information classification, and the quantity distribution and the situation of change of the annual different type research achievement of scientific research personnel are visualized to present through the mode of flowsheet, as shown in FIG. 5 to acquire the scientific research achievement's of scientific research personnel evolution with time output condition. The dimensions of the scientific research results are presented in the form of a relational graph, as shown in fig. 6, the central node is the current researcher and is connected with the research results through a connecting line, and the larger the influence factor of the researcher in the research results is, the shorter the connecting line between the researcher and the research results is;
and the research interest dimension is a research interest keyword obtained through the output results of the researchers and the frequency weight of the research interest keyword, and the appearing time information is given to visually present the research subjects of the researchers. Relevant study topic and frequency information for the authors of each year is presented by time factor through figure 7.
The research interest similar scientific researchers recommend the similar research interest scientific researchers, similarity calculation is carried out on the research interest keywords and the weight data of the forgetting factor of the research interests scientific researchers to obtain a similar research interest scientific researchers group, and the scientific researchers group which visually presents the recommendation comprises the names of the scientific researchers and the institution information
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A multi-dimensional portrait construction method for scientific researchers in the tobacco field comprises the following steps:
1) acquiring tobacco scientific research data; the tobacco scientific research data comprises scientific research personnel data and scientific research result data, and the scientific research personnel basic attribute data comprises names, sexes, organizations, research directions, contact telephones and electronic mailboxes of the scientific research personnel; the types of the scientific research result data comprise papers, patents, standards, monographs and scientific and technological results, and the contents of the scientific research result data comprise titles, keywords, abstracts, authors and publication dates; then, establishing related information of scientific research results participated by scientific research personnel and marking the ranking of the scientific research personnel in the participated scientific research results according to the tobacco scientific research data;
2) fusing repeated data in scientific research personnel and scientific research achievements;
3) establishing basic attribute dimensions of scientific researchers including names, sexes, institutions, research directions and research results based on the tobacco scientific research data processed in the step 2);
4) establishing academic influence dimensionality of the scientific researchers based on the scientific researchers and the related achievement data of the scientific researchers in the step 3);
5) establishing scientific research relation dimensionality of scientific researchers based on the tobacco scientific research data processed in the step 2);
6) classifying the scientific research achievements of each scientific research personnel based on the data acquired in the step 3), and establishing the dimensionality of the scientific research achievements of the scientific research personnel;
7) segmenting the topic, the keyword and the abstract information of each scientific research result of scientific research personnel u, and then deleting stop words in the segmentation result based on a stop word lexicon in the tobacco field; then, constructing a word frequency characteristic matrix according to the frequency of each participle appearing in each research result of the scientific research personnel u; then numbering each participle by using a word bag method and acquiring the occurrence frequency of each participle in each scientific research result of scientific researchers u based on the word frequency characteristic matrix; then, inputting the serial numbers and the serial numbers of the participles and the related information of the occurrence frequency of the corresponding scientific research results into an LDA model to obtain a keyword corresponding to the theme of the research results of the scientific research personnel u as a user interest preference label of the scientific research personnel u, and establishing scientific research interest dimensions of the scientific research personnel u;
8) and visualizing the corresponding scientific researchers according to the dimensional information of the scientific researchers to obtain the multi-dimensional portrait of the scientific researchers.
2. The method of claim 1, wherein the method for fusing the repeated data in the scientific research personnel and the scientific research results comprises the following steps: a) if the two scientific researchers have the same name, mechanism, contact phone and email, deleting one of the scientific researcher data; b) if two scientific researchers with the same name have a cooperative relationship with the same scientific researcher, deleting data of one scientific researcher of the two scientific researchers with the same name; c) if two scientific researchers with the same name have the same research results, deleting data of one of the two scientific researchers with the same name; and if the two scientific achievements have the same achievement title, author and date, deleting one of the two scientific achievements.
3. The method of claim 1 or 2, wherein the scientific researchers academic influence dimension is established by: the designed scientific research personnel influence index system acquires the academic output influence of the scientific research personnel; the output influence of the scientific research personnel comprises 5 dimensions of thesis, special works, patents, standards and scientific and technological achievements, according to the influence index system of the scientific research personnel, the output influence value of each type of scientific research achievement of the scientific research personnel is respectively calculated, and the academic influence of the corresponding scientific research personnel is obtained according to the output influence value of each type of scientific research achievement.
4. The method of claim 3, wherein the yield impact value of each type of scientific result is calculated by the scientific personnel as
Figure FDA0002804376870000021
Wherein P is a set type department of scientific research personnelThe output influence value of the research result, n is the number of the scientific research results of the set type of the scientific research personnel, SiIndicating the index score, W, corresponding to the ith scientific research result in the set type of the scientific research personneliAnd representing the scoring weight corresponding to the ith scientific research result.
5. The method of claim 1, wherein the method for establishing scientific relationship dimensions of the researcher comprises: establishing a cooperative relationship network among scientific researchers based on tobacco scientific research data; wherein if two scientific researchers participate in the research work of the M academic achievements at the same time, the cooperation is considered as M times of cooperation; and deleting the scientific research partners with the cooperative frequency lower than a set threshold value for each scientific research personnel to obtain the scientific research relation dimensionality of the corresponding scientific research personnel.
6. The method of claim 5, wherein the threshold is set
Figure FDA0002804376870000022
Wherein N ismaxThe highest number of times of cooperation of scientific research personnel in the cooperative relationship network.
7. The method of claim 1, wherein the multi-dimensional representation of the researcher is obtained by visualizing dimensional information of the researcher through a radar map, a flow map or a relational map.
8. A scientific research personnel recommendation method comprises the following steps:
1) acquiring tobacco scientific research data; the tobacco scientific research data comprises scientific research personnel data and scientific research result data, and the scientific research personnel basic attribute data comprises names, sexes, organizations, research directions, contact telephones and electronic mailboxes of the scientific research personnel; the types of the scientific research result data comprise papers, patents, standards, monographs and scientific and technological results, and the contents of the scientific research result data comprise titles, keywords, abstracts, authors and publication dates; then, establishing related information of scientific research results participated by scientific research personnel and marking the ranking of the scientific research personnel in the participated scientific research results according to the tobacco scientific research data;
2) fusing repeated data in scientific research personnel and scientific research achievements;
3) establishing basic attribute dimensions of scientific researchers including names, sexes, institutions, research directions and research results based on the tobacco scientific research data processed in the step 2);
4) establishing academic influence dimensionality of the scientific researchers based on the scientific researchers and the related achievement data of the scientific researchers in the step 3);
5) establishing scientific research relation dimensionality of scientific researchers based on the tobacco scientific research data processed in the step 2);
6) classifying the scientific research achievements of each scientific research personnel based on the data acquired in the step 3), and establishing the dimensionality of the scientific research achievements of the scientific research personnel;
7) segmenting the topic, the keyword and the abstract information of each scientific research result of scientific research personnel u, and then deleting stop words in the segmentation result based on a stop word lexicon in the tobacco field; then, constructing a word frequency characteristic matrix according to the frequency of each participle appearing in each research result of the scientific research personnel u; then numbering each participle by using a word bag method and acquiring the occurrence frequency of each participle in each scientific research result of scientific researchers u based on the word frequency characteristic matrix; then, inputting the serial numbers and the serial numbers of the participles and the related information of the occurrence frequency of the corresponding scientific research results into an LDA model to obtain a keyword corresponding to the theme of the research results of the scientific research personnel u as a user interest preference label of the scientific research personnel u, and establishing scientific research interest dimensions of the scientific research personnel u;
8) based on the keywords and weight information of the scientific research interest dimension of the scientific researchers in the tobacco field in the step 7), updating the research interest of the corresponding scientific researchers in real time by integrating a forgetting factor based on time into the weight of the keywords;
9) for a target scientific research personnel, calculating the similarity between the research interests of the scientific research personnel and the research interests of the target scientific research personnel, selecting the first Q most similar scientific research personnel to recommend to the target scientific research personnel, and visualizing the multi-dimensional portrait of the first Q scientific research personnel.
9. The method of claim 8, wherein a formula is utilized
Figure FDA0002804376870000031
Calculating the similarity between the scientific research personnel A and the target scientific research personnel B; wherein KAKeyword List for researcher A, KBIs a keyword list of target scientific research personnel B, and K is KAAnd KBA collection ofA(Ki) As keywords K for the researcher AiWeight of (1), tB(Ki) Keywords K representing target researcher BiThe weight of (c).
10. The method of claim 8, wherein ω is setiThe frequency of the ith keyword of the scientific research personnel appearing in the scientific research result of the scientific research personnel is represented by the weight t ═ omegaiX α; alpha is a time-based forgetting factor.
CN202011362431.4A 2020-11-27 2020-11-27 Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field Pending CN112418695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011362431.4A CN112418695A (en) 2020-11-27 2020-11-27 Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011362431.4A CN112418695A (en) 2020-11-27 2020-11-27 Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field

Publications (1)

Publication Number Publication Date
CN112418695A true CN112418695A (en) 2021-02-26

Family

ID=74843320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011362431.4A Pending CN112418695A (en) 2020-11-27 2020-11-27 Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field

Country Status (1)

Country Link
CN (1) CN112418695A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515644A (en) * 2021-05-26 2021-10-19 中国医学科学院医学信息研究所 Hospital science and technology portrait method and system based on knowledge graph
CN114186002A (en) * 2021-12-14 2022-03-15 智博天宫(苏州)人工智能产业研究院有限公司 Scientific and technological achievement data processing and analyzing method and system
CN115660695A (en) * 2022-11-21 2023-01-31 浪潮通信信息系统有限公司 Customer service personnel label portrait construction method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933699A (en) * 2019-03-05 2019-06-25 中国科学院文献情报中心 A kind of construction method and device of academic portrait model
CN111428056A (en) * 2020-04-26 2020-07-17 中国烟草总公司郑州烟草研究院 Method and device for constructing scientific research personnel cooperative community

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933699A (en) * 2019-03-05 2019-06-25 中国科学院文献情报中心 A kind of construction method and device of academic portrait model
CN111428056A (en) * 2020-04-26 2020-07-17 中国烟草总公司郑州烟草研究院 Method and device for constructing scientific research personnel cooperative community

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
池雪花: "学者精准画像的自动构建研究", 《中国优秀硕士学术论文全文数据库》 *
范晓玉等: "融合多源数据的科研人员画像构建方法研究", 《图书情报工作》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515644A (en) * 2021-05-26 2021-10-19 中国医学科学院医学信息研究所 Hospital science and technology portrait method and system based on knowledge graph
CN113515644B (en) * 2021-05-26 2023-05-26 中国医学科学院医学信息研究所 Knowledge-graph-based hospital science and technology portrait method and system
CN114186002A (en) * 2021-12-14 2022-03-15 智博天宫(苏州)人工智能产业研究院有限公司 Scientific and technological achievement data processing and analyzing method and system
CN115660695A (en) * 2022-11-21 2023-01-31 浪潮通信信息系统有限公司 Customer service personnel label portrait construction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110059311B (en) Judicial text data-oriented keyword extraction method and system
CN112418695A (en) Multi-dimensional portrait construction method and recommendation method for scientific researchers in tobacco field
CN107851097B (en) Data analysis system, data analysis method, data analysis program, and storage medium
CN109783639B (en) Mediated case intelligent dispatching method and system based on feature extraction
US20170061285A1 (en) Data analysis system, data analysis method, program, and storage medium
CN110347814B (en) Lawyer accurate recommendation method and system
Andrian et al. Sentiment analysis on customer satisfaction of digital banking in Indonesia
CN111881302A (en) Bank public opinion analysis method and system based on knowledge graph
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
CN111428152B (en) Method and device for constructing similar communities of scientific researchers
CN111428056A (en) Method and device for constructing scientific research personnel cooperative community
JP2016153931A (en) Information processing method, information processing device, and information processing program
CN114077705A (en) Method and system for portraying media account on social platform
CN112700271A (en) Big data image drawing method and system based on label model
US10803124B2 (en) Technological emergence scoring and analysis platform
CN111091883B (en) Medical text processing method, device, storage medium and equipment
CN110033191B (en) Business artificial intelligence analysis method and system
Weichbold et al. Potential and limits of automated classification of big data–A case study
Perikos et al. Opinion mining and visualization of online users reviews: a case study in Booking. com
CN110010231A (en) A kind of data processing system and computer readable storage medium
CN114445141A (en) Customer demand obtaining method
CN113972009A (en) Medical examination consultation system based on clinical examination medical big data
CN113760918A (en) Method, device, computer equipment and medium for determining data blood relationship
RU132587U1 (en) INTELLIGENT QUALITY ASSESSMENT SYSTEM FOR SCIENTIFIC AND TECHNICAL DOCUMENTS
CN117370448B (en) Brand digital asset insight analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210226