CN111428152A

CN111428152A - Method and device for constructing similar communities of scientific research personnel

Info

Publication number: CN111428152A
Application number: CN202010339353.XA
Authority: CN
Inventors: 郑新章; 王锐; 王永胜; 刘亚丽; 冯伟华; 贾楠; 王迪; 宗国浩; 王峙
Original assignee: Zhengzhou Tobacco Research Institute of CNTC
Current assignee: Zhengzhou Tobacco Research Institute of CNTC
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2020-07-17
Anticipated expiration: 2040-04-26
Also published as: CN111428152B

Abstract

The invention relates to a method and a device for constructing similar communities of scientific research personnel, and belongs to the technical field of data processing. The method comprises the following steps: acquiring co-quoted data, wherein the co-quoted data comprises scientific research documents, citation relations among the scientific research documents and authors corresponding to the scientific research documents; establishing an author influence model, and calculating the author influence; screening co-introduced data, wherein the screening comprises the following steps: deleting the authors with the influence lower than the influence set value from the co-quoted data; generating an author co-introduced relationship network according to the screened co-introduced data, wherein the author co-introduced relationship network comprises: the number of times of co-introduction between each author and other authors respectively refers to the number of times that a paper corresponding to two authors is simultaneously introduced by another paper; and generating a similar community of scientific researchers based on the author co-introduced relationship network and the influence of the scientific researchers by adopting a community discovery algorithm. The invention reduces the data processing amount and improves the construction efficiency and the information accuracy of the map.

Description

Method and device for constructing similar communities of scientific research personnel

Technical Field

The invention relates to a method and a device for constructing similar communities of scientific research personnel, and belongs to the technical field of data processing.

Background

The scientific knowledge map similar community refers to a group consisting of a plurality of scientific researchers with similar research interests, and is displayed in a visual network map mode by constructing a research interest similarity network of the scientific researchers, identifying the community structure and the community relation of the similar scientific researchers and displaying the community structure and the community relation. Nodes in the visual network map represent scientific research personnel and edges represent similar relations among the nodes, and an academic group formed by clustering based on research interest similarity is displayed. Similar community networks in the scientific and technological knowledge graph and social networks in real life have similar characteristics and community structures, namely people belong to different communities, the whole network is composed of a plurality of communities, the connection between nodes in each community is relatively tight, and the connection between communities is relatively sparse. The size of each node in the network and the thickness of the connecting line have practical significance so as to reveal the influence of the network nodes and the closeness degree of research directions of scientific research personnel among the nodes. The community discovery method is to discover groups with module structure characteristics from complex network relationships and combine field knowledge data to realize exploration of the human community structure in the field.

The community discovery algorithm is applied to the field of social networks for the earliest time, and is used for discovering or finding out social groups with the same interests and hobbies. In the scientific and technical field, after years of development in various fields, a plurality of scientific researchers produce a great deal of valuable scientific research achievements, the scientific research achievements show association relations among different scientific and technical personnel from various directions and angles, for example, scientific research project cooperation relations reflecting cooperation among the scientific research personnel, scientific research paper citation and quotation relations reflecting commonness in a certain research direction, and a common quotation relation, wherein the common quotation relation is based on a common quotation analysis theory of author similarity, and when the literature of two authors is cited by the literature of a third author at the same time, the two authors are called to have a common quotation relation. The higher the co-introduced frequency of the two authors, the more closely the academic relationships of the two authors are. Two authors' documents are often cited together to show that they are related in subject, concept, theory, methodology of academic research. Therefore, community discovery is carried out based on the co-introduced relation data among a large number of authors, community groups of scientific research personnel with the same research interests and directions can be reflected, and an assistant decision-making effect is played for recommendation of potential partners and research of research frontier information of corresponding scientific research fields.

However, as the number of scientific research personnel involved in a scientific research field is large, scientific research documents are published in a large quantity for many years, and mutual referred relationships are complicated and complicated. Meanwhile, today, with intensive technology development, development and research in any field or direction cannot be isolated in scientific research activities, and exploration and research of a certain subject in a certain field cannot be separated from technical support of other fields, which also leads to generation of a large number of interdisciplines and increasingly subdivided and precise technical branches today. In this context, in each research and development direction of each field, the shadow of the auxiliary technology or the supporting technology in other fields appears, which leads to that in a certain technical field, for example, in the tobacco technical field, the produced papers refer to papers produced by active technicians in this field as well as a large number of papers in cross technical fields or other fields, which leads to the complex and huge co-quoted relationship and related data. The community discovery algorithm adopting the traditional social network has the defects of large community discovery calculation amount, low calculation efficiency, large occupied hardware resources, inconvenience for updating data in real time or frequently and timely, information lag and difficulty in maintaining the accuracy of an information map. Meanwhile, each scientific research personnel (author) in the co-introduced data is the minimum unit in the similar community and is also a node on the final information map, and the large number of the nodes directly causes the map information to be redundant and poor in readability, so that direct and effective information is difficult to extract; in addition, the similar community map is used for discovering groups of scientific researchers with similar research directions and similar scientific research interests, the existence of a large amount of cited data of papers in the cross technical field or other fields can cause deviation of communities discovered based on the common cited data, namely, the similarity between the research directions and the scientific research interests of the scientific researchers in the communities is reduced, and the scientific research interests or the scientific research directions of all communities need to refer to the research directions of the scientific researchers serving as nodes in the communities, so that the existence of a large amount of scientific researchers in other fields can influence the judgment of community academic or scientific research directions, cause deviation or even errors of academic and scientific research directions reflected by the communities and related topics, finally convey wrong technical information, and greatly influence user experience.

Disclosure of Invention

The invention aims to provide a method and a device for constructing similar communities of scientific research personnel, which are used for solving the problems that the data processing capacity is large and the generated similar communities have deviation to influence the user experience when the conventional community discovery method is used for constructing similar communities in the technical field.

In order to achieve the above object, the scheme of the invention comprises:

the invention relates to a method for constructing a similar community of scientific research personnel, which comprises the following steps:

1) acquiring citation relation data, wherein the citation relation data comprises scientific research documents, citation relations among the scientific research documents and authors corresponding to the scientific research documents;

2) establishing an author influence model, and calculating the author influence;

3) screening reference relationship data, the screening comprising: deleting the author whose influence is lower than the influence set value from the reference relation data;

4) generating an author co-referenced relationship network according to the screened reference relationship data, wherein the author co-referenced relationship network comprises: the number of times of mutual introduction between each author and other authors is the number of times that scientific research documents corresponding to the two authors are introduced by another scientific research document;

5) and generating a similar community map of the scientific research personnel by adopting a community discovery algorithm based on the author co-introduced relationship network and the influence of the scientific research personnel.

Scientific researchers in a scientific research field, papers written by scientific researchers and citation relations among the papers are large in complexity and quantity, papers related to supporting technologies or auxiliary technologies are referred to and cited by the papers in the field, authors of the supporting technologies or auxiliary technologies have reference values in community discovery maps in the field (the scientific research field for similar community discovery), but papers in a large number of other fields (other scientific research fields except the scientific research field for similar community discovery) are referred to by the papers in the field only a few times, and the papers are referred to by a single research direction or a plurality of research and development directions. Authors of the cited papers in other fields have little reference value in the similar community maps in the field, and therefore, cited data of the papers in other fields have certain influence on similar community discovery in the field and final community discovery maps.

The processing of the data is very tricky, and the method of directly deleting or filtering out all non-field scientific researchers and the papers written by the non-field scientific researchers based on the field is not advisable, so that the current scientific research activities are frequent, a lot of scientific researchers in the cross-field are available, and each scientific researcher is difficult to be attached with a determined field label; and secondly, technicians in other fields appear in similar maps in the field and are valuable under certain conditions, so that the key of some unique technical problems in the field can be found conveniently, or the technical and scientific researchers in other fields depending on the related technical problems in the field can be found, and the scientific researchers have great reference significance for technical report and can bring great help to the realistic effect of the similar community maps.

In the face of the technical problem of data processing, scientific research literature citation data is filtered according to the influence of authors in corresponding scientific research fields before a community discovery algorithm, and the influence of authors (scientific researchers) in papers with fewer citations in corresponding field papers in the fields is lower; reflecting the criticality of a particular technical problem in the field, researchers in other technical fields will also have a high impact in the field. Therefore, the method of the invention can firstly reduce scientific researchers in other fields with less times of citation of papers in the field and low reference value in the field; secondly, scientific researchers in the field with little reference value as technical statements although in the field are filtered out; meanwhile, scientific research personnel who provide good support and assistance for various technical key problems in the field are prevented from being filtered out although the fields are cross fields or cross fields. The accuracy of the data source is guaranteed.

The method effectively reduces data nodes which have low reference values and cause interference in the community discovery map in the later period, increases the readability of the map, and improves the user experience; in addition, scientific research interests and directions reflected by finally generated similar communities and deviations related to scientific research topics are effectively reduced. Meanwhile, compared with a method for adjusting related parameters and threshold values in a community discovery algorithm, the method has the advantages that the calculation complexity is reduced, the calculation efficiency is improved, and technical support is provided for improving the updating frequency of the community map.

Further, the influence set value N is:

wherein N is_maxIs the author highest value of influence in the reference relationship data.

Further, in the step 3), the screening further includes deleting the author whose publication amount of the scientific research literature is lower than the publication amount set value from the citation relation data.

Further, the set value M of the message sending amount is:

wherein M is_maxAnd sending the highest value of the document amount for the author in the reference relation data.

Further, in step 3), the screening further includes deleting the authors whose published scientific research documents are cited less than the reference amount set value from the citation relationship data.

Further, the reference amount set value R is:

wherein R is_maxThe highest value of the sum of the cited times of all documents authored by the same author in the citation relationship data.

According to the method, the similarity analysis of the scientific researchers builds the author similarity network based on the co-quoted relation between papers published by the scientific researchers, and the quality of the co-quoted data directly influences the quality of the discovered community structure, so that the scientific researchers select the factors of the issued amount and the quoted amount of the scientific researchers, on one hand, the yield output amount and the scientific research capability of the scientific researchers are reflected, and on the other hand, the quality of the scientific research results of the authors is reflected. Therefore, three author selection criteria for constructing the author similar network can be provided, namely the author is used as the influence of scientific research personnel in the scientific research field, the lowest number of issued documents is provided, and the lowest introduced amount is provided. Interference data which cannot truly and accurately reflect scientific research information of the generated community can be effectively filtered.

In addition, if the number of published papers or cited papers cannot reach a certain number, the co-cited data of the papers of the scientific researchers cannot correctly reflect the relationship between the papers and the research interests of other scientific researchers, so that the reference value of the papers or cited papers in the similar community map is very small.

The method adopts the Proles' law to determine the lowest influence value, the lowest text number and the lowest introduced quantity of core scientific researchers. The filtering standard is scientific and effective.

Further, in the author influence model, the author influence is at least determined by the number of scientific outputs of the author, and the scientific outputs include the following categories: articles and patents.

The yield in the field is less, the quality of the result is lower, the natural influence is smaller for scientific researchers in the field, and the reference value of the result as technical information content is not large; for researchers in other fields and across fields, even though the number of times of citations of the thesis by the thesis in the field is large, the corresponding outcome is few, which indicates that the thesis only assists in solving some conventional problems and cannot reflect the special technical problem and solution in the field. As a technical report, the technology represented by similar data in the field is highly replaceable, and the reference value in the map is not large (namely, scientific researchers in other fields solve conventional problems in the field).

While other areas or cross-areas of research represent techniques that solve unique problems in the art, leading to innovative results, there is a certain amount of success in developing research in this area, such as patenting or issuing new papers (i.e., other areas of research solve unique problems in the art).

Therefore, the results of scientific research output of the thesis, the patent and the like in the field are used as decisive factors in the influence calculation model, and the filtering effect and the accuracy are better.

The difference between "other field researchers solve conventional problems in the art" and "other field researchers solve outstanding unique problems in the art" is exemplified as follows. For example, researchers in the chemical field synthesize an organic compound, and papers are published on the organic compound, and the organic compound can greatly improve the taste of cigarettes and reduce the generation of harmful substances as an additive in the cigarette technology in the tobacco technical field, so that the organic compound becomes a unique innovation in the tobacco field, and the papers of the organic compound are introduced to a certain extent in the tobacco field. In this case, researchers in this chemical field will correspondingly produce a great deal of work in the tobacco technology field, such as new papers (application of the organic compound in cigarette processing, etc.), patents, and even national or industrial standards regarding the amount of such compound additives. Scientific researchers in the other control field develop an efficient furnace temperature closed-loop feedback control method, control over the furnace temperature is more accurate, energy consumption is reduced efficiently, and papers are published based on the method. The thesis is quoted in the tobacco science and technology field about tobacco leaf moisture regain, the furnace temperature control method for dealing with tobacco leaf moisture regain is provided, the energy consumption is low, the effect is good, the thesis quote is limited only in the tobacco leaf moisture regain direction in the tobacco science and technology field, the quantity is small, heating and temperature control are strong in replacement for tobacco leaf moisture regain, and other methods can be adopted for temperature control, so that technical personnel in the control field have low reference value in a similar community map in the tobacco science and technology field, certain information interference can be caused when the heating and temperature control are left in the map as a node, and the large amount of data can increase community finding calculated amount, increase the complexity of the map and reduce the readability of the map. Due to the strong substitution of the related technology of temperature control in the tobacco field, scientific researchers in the control field have few related achievements in the tobacco technology field, and all achievements should be produced in the related technical field of control. Therefore, based on the method and the influence model, the interference data can be effectively filtered, the calculated amount is reduced, and the readability of the map is improved.

Further, the author influence model is as follows:

wherein P is influence, n is the number of scientific research output of the author, S_iSet score for category to which the ith scientific output of the author belongs, W_iScientific output in this category for the ith scientific output of this authorThe set weight of (1).

The influence calculation model for the scientific researchers is based on the quantity and quality of scientific research outputs of the scientific research field of the scientific research personnel for generating the atlas, and different basic scores and weights are set for different scientific research outputs (such as winning scientific achievements, papers or standards) and different output grades of the same type of scientific research outputs (such as the scientific achievements for obtaining national-level and provincial-level scientific awards). The influence of scientific research personnel in the corresponding field is objectively and accurately calculated by adopting the model, and the obtained numerical values are convenient to compare.

Further, in step 4), when the common introduced relationship network is generated, the common introduced times between two scientific researchers with the common introduced times lower than the set value are recorded as 0 times.

The common introduced relation with the number of times of common introduction is less than a certain number of times of common introduction, and a certain scientific research similarity relation between two corresponding authors is difficult to accurately embody, so that the common introduced relation with the lower number of times of common introduction is deleted, the community discovery calculation amount is reduced, and the accuracy of the community map is improved.

Further, in step 4), when the researcher similarity relationship network is generated, for any two scientific research documents earlier than a set time, the number of times that the authors of the two scientific research documents quoted together due to the scientific research documents is recorded as a, and 0< a < 1. When the number of times of the common introduction is calculated, different mapping methods can be used for mapping the number of times of the common introduction at one time in a specific time to a numerical value from 0 to 1 so as to reflect the influence of different times.

In the process of constructing the similarity relationship matrix, in addition to the frequency of the common introduced relationship, the similarity between the authors is calculated to be used as the main analysis data, in order to ensure that the analysis result data has timeliness and the information reflected by the generated community has more value, the author similarity value is added with a common introduced relationship time sequence weight parameter on the basis of calculating the author introduced relationship value, for example, 1 common introduced of the publication time of the common introduced paper earlier than a certain age is multiplied by a corresponding weight coefficient which is larger than zero and smaller than 1.

Further, a topic label is added to each community in the similar community map of the scientific research staff, and the topic label is a set number of key words of the thesis published by the author serving as the node in the corresponding community, wherein the word frequency of the key words is the highest.

The keywords with high word frequency in all keywords of all papers published by all scientific researchers in the community are used as labels, so that the research direction and the research focus of the scientific research community can be reflected, the readability of the map is improved, and readers are assisted to identify the research subjects of all communities.

The device for constructing the similar communities of the scientific research personnel comprises a processor and a memory, wherein the processor executes instructions stored in the memory so as to realize the method for constructing the similar communities of the scientific research personnel.

Drawings

FIG. 1 is a flow chart of a method for constructing a similar community of scientific researchers according to the present invention;

FIG. 2 is a schematic diagram of a researcher influence model;

FIG. 3 is an example of a similar community map generated by the method of the present invention in the field of tobacco research.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The method comprises the following steps:

the invention relates to a method for constructing similar communities for scientific researchers, which takes the field of tobacco scientific research as an example and explains the method. The construction of the similar communities of the scientific researchers in the invention refers to the data cited in scientific and technological documents such as papers published by the scientific researchers, so the authors are also called the corresponding scientific researchers in the text. The flow of the method of the invention is shown in fig. 1, and specifically comprises the following steps:

s1, acquiring quoted relation data in the tobacco field, wherein the quoted relation data comprises document entities, relations and attribute data, and the acquisition of the co-quoted data mainly comprises the extraction, fusion and screening of the tobacco scientific literature entities, relations and attribute data, for the tobacco field, a tobacco scientific literature data source comprises data in formats of XM L, HTM L and the like, firstly, core metadata of each entity, relation and attribute is established, wherein 1 incidence relation is established between a tobacco scientific researcher and a scientific research document, the writing relation between the tobacco scientific research researcher and the scientific research document is established, 1 incidence relation is established between a tobacco scientific research document and the tobacco scientific research document, the quoted relation between the scientific research documents is established, and core metadata templates of the tobacco scientific research researcher entities, the scientific research literature entities, the published paper relations and the paper quoted relation are shown in the following tables 1, 2, 3 and 4:

TABLE 1 tobacco scientist entity core metadata template

TABLE 2 core metadata template for tobacco scientific research literature entity

Serial number	Attribute name	Type (B)	Description of the invention
				1	id	long	Paper unique identification
2	name	text	Name of thesis
				3	author	text	Authors of the paper
4	issue_year	int	Year of publication
				5	paper_type	text	Paper type
6	reference_num	int	Number of quoted papers

Table 3 presents a paper relational core metadata template

Serial number	Attribute name	Type (B)	Description of the invention
				1	person_code	long	Unique identification of scientific research personnel
2	paper_code	long	Scientific research paper unique identifier

Table 4 thesis reference relationship core metadata template

Serial number	Attribute name	Type (B)	Description of the invention
				1	paper_code	long	Paper unique identification
2	paper_code	long	Paper unique identification

And (4) by analyzing different data sources, storing the data according to the template to form reference relation data, and preparing for data cleaning and fusion.

S2, performing data fusion processing on the original data, wherein the record formats and content fields of the original data extracted from different channels are different, the extracted scientific research personnel and scientific research documents can also be repeated, integrating the data from different sources according to data fusion rules, supplementing and perfecting the fields of the data, and removing repeated data, performing a multi-data-source information fusion method based on a Spark cluster based on the imported original data, performing data fusion processing, wherein the data processing rules are shown in Table 5, organizing the collected data according to a defined ET 35L data fusion language by adopting processing rules, and defining the data fusion and the JSON formats as XM L or JSON formats, and the data fusion and the processing of a computing engine are carried out by a big data platform and a distributed processing tool Spark, and the process is as follows:

(1) the ET L is executed on data of a different format (e.g., JSON or XM L used by the ET L fusion language) using Spark SQ L, parsed, translated, and then exposed to the big data processing environment for execution.

(2) The execution flow of Spark SQ L is that Spark SQ L processes SQ L statements and a relational database processes SQ L statements by using a similar method, the SQ L statements are firstly analyzed (Parse) to form a Tree, the subsequent processing processes such as binding, optimization and the like are all operations on the Tree, the operation method adopts Rule, different operations are adopted on different types of nodes through pattern matching, the Tree and the Rule are mutually matched in the whole processing process of the SQ L statements, the processes such as analysis, binding, optimization, physical planning and the like are completed, and finally an executable physical plan is generated.

(3) The database provides an execution plan (Execute) when processing Data, runs statistical Data, selects an optimal plan (Optimize) and obtains results through the sequence of Operation- > Data Source- > Result.

(4) And performing stream computing processing on the data through Spark Streaming, wherein the stream computing processing comprises information such as real-time computation, fusion, statistics and the like of paper publication conditions of tobacco scientific researchers, and storing the processed information into a Neo4j database.

TABLE 5 data fusion rules between researchers and research projects

And S3, calculating the influence of scientific research personnel. And constructing an influence calculation model for scientific researchers, synchronously updating the influence serving as an attribute value into a graph database by adopting a pre-calculation mode, and periodically updating the attribute value according to the data updating frequency.

The model for calculating the influence of the scientific researchers is shown in fig. 2, the calculation of the influence is based on the scientific research output of the corresponding scientific researchers in the field of tobacco scientific research, and specifically comprises a secondary index system which is respectively a scientific research output type and a scientific research output level corresponding to the scientific research output type, and a third level index may be provided for part types of scientific research outputs, such as corresponding prize winning levels (first-class prize, second-class prize, etc.) in awards. The invention is not limited to the scientific research output types and the division of more indexes. In this embodiment, the final-stage indexes of each type of output are scored, specifically, the final-stage indexes of each type of output (i.e., the output scores of each type of output) are obtained according to the basic scores of the output types and the additional scores of the corresponding second-stage and third-stage indexes, the final-stage indexes are secondary indexes (corresponding to patents, papers, standards, and works in fig. 3) which score the outputs under the second-stage indexes, and the indexes with the third-stage indexes (corresponding to rewards in fig. 3) which score the outputs under the third-stage indexes. Each scientific researcher multiplies the yield score of a scientific yield by a score weight based on the score of the yield, the score weight being related to the amount of contribution of the scientific researcher in the yield, which may be determined by the signature order of the yields or from the original record of the yields, such as a work log. The final influence score of the scientific research personnel is obtained by accumulating the scores of all the outputs of the scientific research personnel. Based on the model, the calculation formula of the score of the scientific research personnel is as follows:

wherein P is the influence score of a scientific researcher, n is the scientific research output quantity of the scientific researcher, S_iThe score of the corresponding final-stage index (namely the yield score of the corresponding yield) produced by the ith scientific research of the scientific research personnel is represented, W_iThe score weight (determined by the contribution size) corresponding to the ith scientific outcome for that researcher.

For example, if a technician receives a provincial-level three-level prize, the score of the technician for the output will be the output score of the provincial-level prize of the three-level prize, multiplied by the score weight determined by the contribution of the technician in the scientific result.

And S4, screening data. Screening of scientific researchers is carried out in order to ensure the quantity and quality of analysis data in similar networks among the scientific researchers. Scientific researchers with influence higher than a threshold value in the field are used as objects for generating and displaying the similar atlas, scientific researchers with influence smaller than N in the field are filtered out according to the Proles' law, and the calculation formula of N is as follows:

wherein N is_maxAnd expressing the highest influence value among the scientific research personnel in the field, and after the N value is calculated, the scientific research personnel with influence more than or equal to N in the field are used as the scientific research personnel in the similar community analysis data source.

And the screening can be further carried out based on the amount of the texts sent by scientific research personnel. According to the plece law, scientific researchers with the issue amount smaller than M are further filtered out, and the calculation formula of M is as follows:

wherein M is_maxAnd (4) after the highest value of the text volume is sent to scientific researchers in the reference relation data and the value M is calculated, the authors with the text volume more than or equal to M in the field are used as the scientific researchers in the similar community analysis data source.

Screening can be further carried out based on the cited amount of the literature issued by scientific research personnel. According to the plece law, scientific researchers with the issue text volume smaller than R are further filtered out, and the calculation formula of R is as follows:

wherein R is_maxThe highest value of the sum of the cited times of all the documents written by the same author in the citation relationship data, namely in the citation relationship data, the sum of the cited times of all the published documents is calculated for each author, and the sum of the cited times of the author with the largest sum of the cited times of the published documents is R_max. After the R value is calculated, authors of documents written in the field, which are quoted for times greater than or equal to R, serve as scientific researchers in the similar community analysis data source.

And S5, generating a common introduced relation network of scientific research personnel. The scientific research personnel similar network takes authors as network nodes and the similarity between authors as the weight of edges, and the specific construction steps of the similar community network comprise author similarity calculation, strong similarity relation screening, construction of research interest similar networks and community identification. And constructing similarity relation between the data of the filtered scientific research personnel.

The similarity calculation method adopted by the invention utilizes the common referred relationship between authors, and when papers published by two scientific researchers are often referred to by papers published by other scientific researchers at the same time, the two scientific researchers have similar relationship. And calculating the co-introduced frequency of the papers published between any two scientific researchers as similarity, and constructing a co-introduced relation matrix of the scientific researchers.

If papers P1 and P2 both refer to the paper of author A, B, then there is a co-referenced relationship between authors A, B, and the similarity value is 2. By calculating the similarity relation value of the core authors two by two, a similarity relation matrix between the authors can be obtained. The numbers in the matrix represent the similarity (i.e. the number of the introduced times) between the scientific research personnel in the corresponding horizontal column and the vertical column.

In the scientific research personnel co-introduced relation network, if the co-introduced times of two authors are less, for example, the co-introduced times only exist once, the similarity relation between the two authors is also kept, so that the generated similarity network is high in intensity, and the relation with the common introduced times in the network can be removed as further screening. For example, the relationship with the screening co-introduced times larger than N is reserved, and the network is reconstructed based on the screening to form the scientific research personnel co-introduced relationship network.

The scientific research personnel co-introduced relation network can be further filtered based on time, the weight is set for the co-introduced relation that the common cited time of papers of two scientific research personnel is earlier than the set time, and the influence of the early co-introduced relation is reduced. Since the research direction and field of researchers may change over time, the significance and value of early co-introduced data to similar community maps is reduced. Therefore, one or more levels of weights may be set, for example, the weight of the co-quoted relationship between two researchers in the last 2 years is 1, the weight of the co-quoted relationship in the last 2-5 years is 0.8, and the weight of the co-quoted relationship over 5 years is 0.5.

For example, before filtering based on the total introduced time, the similarity relation introduced by two researchers in 4 times 6 years ago is 4, and after filtering according to the weight, the contribution of the 4 times of total introduction to the similarity relation of the two bits is only 2; for another example 1 time before 3 years, all contribute only 0.8 after filtration.

S6, community discovery is carried out by adopting an L ouvain algorithm based on a finally obtained researcher similar relationship network (co-quoted data) and researcher influence data, a L ouvain algorithm is a community discovery algorithm based on modularity, a hierarchical community structure can be discovered, communities (groups of researchers with co-quoted relationships and similar research directions and interests) where researchers are located in the tobacco scientific research field can be discovered to the maximum extent, specifically, the method is based on the researcher co-quoted relationship network, adopts a louvain algorithm, and specifically comprises the following steps:

1) and traversing nodes in the network in the 1 st stage, and performing node transfer between social intervals. For each node A in the network, the node A is tried to be added into the community where each neighbor node is located in sequence, and the modularity change delta Q before and after the node A is added is calculated. If the maximum delta Q is greater than 0, adding the node A into the community where the neighbor node with the maximum delta Q is located, otherwise, not changing the community attribution of the node A;

repeating the node transfer step until the community attribution of all the nodes is not changed, finishing the node transfer of the community interval, and finishing the stage 1;

2) reconstructing a graph in the 2 nd stage, reconstructing the community formed after the first stage is finished into a new node, wherein the weight of the edge between the new nodes is the sum of the weights of the edges of the original community interval, and the weight of the new node to the ring of the new node is the sum of the weights of the edges between the nodes in the original community;

continuously iterating the 1 st stage and the 2 nd stage of the reconstructed graph until the modularity of the whole graph is not changed any more, and storing community information of each node;

and determining the final community attribution of each node according to the community information of each node, and finishing community discovery.

And S7, visualizing similar communities of scientific research personnel in the tobacco field. The similar community visualization is realized based on Vis.js visualization framework, similar community JSON data is loaded through a vis.Net method in Vis.js, and the organization structure of the similar community JSON data is as follows:

wherein: nodes represents scientific researchers (id is the unique identification of the scientific researchers, name represents the name, influence represents the influence of the scientific researchers, group represents the grouping number of the community, and size represents the size of the node in the community); edges represents relationship edges (from represents the unique identifier of the starting node of the edge, title represents the number of times of co-introduction between nodes, to represents the unique identifier of the ending node of the edge, and width represents the width of the edge).

The visual atlas of the similar community in the tobacco scientific research field finally generated by the method is shown in fig. 3, each dot in the graph represents a scientific researcher, the size of each dot represents the influence of the scientific researcher in the tobacco field, and the width of a connecting line between each dot and each other reflects the number of papers of two authors, namely the similarity, quoted by other papers. Authors with high similarity come together to form a scientific community, with communities of the same color (grey scale) representing a community. By co-introduced relationship, it can be found that if the co-introduced frequency of two authors is higher, the academic relationship of the two authors is more close. And authors, often cited together, are related in concept, theory, and methodology to the subject of academic research.

The similarity degree of research interests between every two authors can be determined through the width of a relation edge in a similar community map in the tobacco scientific research field, a group of related authors in the same community are introduced with frequency (time sequence parameter) in community clustering mode analysis, so that prominent relation links between the authors can be revealed, research fields represented by the authors respectively or jointly can be revealed, and an auxiliary decision-making effect is achieved for recommendation of potential partners and research of information at the front of tobacco field research.

Identifying research topics of each community, for example, extracting keywords of papers published by medical researchers in similar communities, counting word frequencies of the keywords, taking the first N (such as the first 3) with the highest word frequency as the research topics of the similar communities, adding research topic labels to corresponding communities of the map, increasing readability of the map, and assisting readers in identifying the research topics of each community.

The embodiment of the device is as follows:

the processor can obtain document data of tobacco or other technical fields including information of cited documents, authors and the like of the documents through the data interface and execute programs stored in the memory to realize the method for constructing the similar community of the scientific researchers.

Claims

1. A method for constructing similar communities of scientific research personnel is characterized by comprising the following steps:

5) and generating a similar community map of scientific research personnel by adopting a community discovery algorithm based on the author co-introduced relationship network and the author influence.

2. The method for constructing the researcher similar community according to claim 1, wherein the influence set value N is:

3. The method for constructing the researcher similar community according to claim 1 or 2, wherein in the step 3), the screening further comprises deleting authors whose scientific research literature text volume is lower than a text volume set value from the citation relation data.

4. The method for constructing the similar community of scientific researchers according to claim 3, wherein the set value M of the issued text volume is:

wherein M is_maxPublishing text volume for author in reference relationship dataThe highest value.

5. The method for constructing the researcher similar community according to claim 4, wherein in the step 3), the screening further comprises deleting authors whose published scientific research documents are quoted less frequently than a quoted amount set value from the quoted relationship data.

6. The method for constructing the researcher similar community as claimed in claim 5, wherein the reference amount set value R is:

7. The method as claimed in claim 6, wherein the author influence model is determined by at least the number of scientific outputs of the author, and the scientific outputs include the following categories: articles and patents.

8. The method for constructing the researcher similarity community according to claim 7, wherein the author influence model is as follows:

wherein P is influence, n is the number of scientific research output of the author, S_iSet score for category to which the ith scientific output of the author belongs, W_iAnd setting weight for the ith scientific output of the author in the category of scientific outputs.

9. The method for constructing the researcher similar community according to claim 8, wherein in step 4), when the common introduced relationship network is generated, the common introduced frequency between two researchers whose common introduced frequency is lower than a set value is recorded as 0.

10. The method for constructing the similar community of scientific research personnel as claimed in claim 9, wherein in step 4), when the co-cited relationship network is generated, for the scientific research documents earlier than a set time, the co-cited times of the authors of any two scientific research documents cited in the co-cited relationship network due to the scientific research documents are recorded as a, and 0< a < 1.

11. The method for constructing the similar communities of the scientific researchers as in claim 10, wherein a topic label is added to each community in the similar community map of the scientific researchers, and the topic label is a set number of key paper words with highest word frequency in papers published by authors serving as nodes in the corresponding community.

12. A device for constructing a similar community of researchers, comprising a processor and a memory, wherein the processor executes instructions stored in the memory to implement the method for constructing a similar community of researchers as claimed in any one of claims 1 to 11.