CN110362692A

CN110362692A - A kind of academic circle construction method of knowledge based map

Info

Publication number: CN110362692A
Application number: CN201910668329.8A
Authority: CN
Inventors: 龙军; 魏志; 黄文体; 唐柳
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2019-10-22

Abstract

The invention discloses a kind of academic circle construction methods of knowledge based map, comprising steps of step 1, obtains all academic paper information and all academic periodical informations, and as initial data source；Step 2, author, paper and periodical these three entity informations are extracted from initial data source, constitute entity data set；Step 3, author's entity of the same name is concentrated to solid data, and disambiguation processing of the same name is carried out based on mutual similarity；Step 4, the entity data set obtained after disambiguation of the same name processing is stored in Neo4j chart database, forms entity node；Based on the public attribute feature between different entities, the opening relationships side between different entities node finally obtains the academic circle of knowledge based map.The academic circle with logical relation that the present invention constructs, data accuracy is high, is conducive to user and quickly and effectively gets logical relation between required knowledge and required knowledge.

Description

A kind of academic circle construction method of knowledge based map

Technical field

The present invention relates to academic social networks technical field, in particular to a kind of academic circle building sides of knowledge based map Method.

Background technique

With the development of computer networking technology, the hardware and software platform of academic social networks and networking have also obtained rapid hair Exhibition, provides good academic exchange platform for scholar.Currently, more famous academic social networks has both at home and abroad ResearchGate, Academia, scientific net and small carpenter worm.With the development of academic activities and scientific research, all can daily There are the addition of new scholar and scientific research personnel, will lead to the fierce multiplicity to increase severely with scholar's user type of scholar's quantity in this way Change, therefore, a good academic social networks will be discussed important with academic exchange as every field scholar scientific achievement Platform.Researcher can cooperate on academic social networks, participate in peer review, share their research, or even divide Enjoy data.Therefore, it receives the favor of a large amount of scholars, especially young scholar.It can be said that academic social networks is Gradually change our research mode.The researching value of academic social networks has caused the close attention of scholars.Research Personnel have carried out a large amount of research to academic social networks, find academic social networks in the exchanges and cooperation that advance science, and It carries out substitution metering aspect and plays positive effect.

Early in 2000, external academia attempted to set up the professional social networks specifically for researcher, such as SciLinks, Scientist So-lutions, Nature Network etc. provide basic clothes for the online exchange of researcher Business.With being continued to develop towards public social networks, the well-known social network sites such as Face-book, Twitter also begin trying for Researcher builds academic exchange platform, but the professional of its science service receives the query of some scholars.Until 2008, There is the online academic exchange platform using ResearchGate, Mendeley as representative in foreign countries, have incorporated Open Access Journals and society The theory for handing over network can not only help researcher to find the scholar of same area and provide online service for them, moreover it is possible to The channel for obtaining a large amount of valuable knowledge resources is provided for researcher.Then, also there is a batch and has identity function in the country Website, wherein representational includes scholar's net, Phegda science circle, Baidu academic, scientific net, CNKI scholar's circle etc..These It is dedicated to the rise and development that promote the website of academic exchange and cooperation to push academic social networks.Academic social networks be with For the purpose of promoting exchange of knowledge and diffusion, researcher can be helped to establish and safeguard their human relation network, while can Them are supported to be engaged in service or the platform of Activities in the course of the research.

And academic social networks has the following problems at present: existing science social networks provides good cooperation for its user The function of platform, but it is really considerably less in the cooperative relationship being established above.The reason for this is that existing science social network Network provides multiple groups for scholar, and different subjects and theme is added according to oneself specialty background and hobby for scholar In group, most of group is caused to be from the member composition of different discipline backgrounds, so that group has apparent hand over Phenomenon is pitched, so that the storage of existing academic information data is scattered, so that established based on storing scattered academic information data Science circle data are inaccurate.

Summary of the invention

The inaccurate problem of scattered and building academic circle data is stored for existing academic information data, the present invention proposes A kind of academic circle construction method of knowledge based map, can be improved the accuracy of academic circle data.

To realize the above-mentioned technical purpose, the present invention adopts the following technical scheme:

A kind of academic circle construction method of knowledge based map, comprising the following steps:

Step 1, all academic paper information and all academic periodical informations are obtained, and as initial data source；

Step 2, the entity information of pre-selection entity type is extracted from initial data source, constitutes entity data set；It is described pre- Selecting entity type includes author, paper and periodical；

Step 3, author's entity of the same name is concentrated to solid data, is carried out at disambiguation of the same name based on mutual similarity Reason；

Step 4, the entity data set obtained after disambiguation of the same name processing is stored in Neo4j chart database, forms entity Node；Based on the public attribute feature between different entities, opening relationships side, finally obtains knowledge based between different entities node The academic circle of map.

The present invention utilizes Neo4j by extracting the entity of author, 3 seed type of paper and periodical from initial data source Chart database constructs entity node；It then is difference using the public attribute feature between different entities in Neo4j chart database Entity node opening relationships side obtains the academic circle of knowledge based map.It is equivalent to three kinds of paper, author and periodical inhomogeneities Relationship is connected in a relational network between the entity and entity of type, and composition is mutually related academic circle, and then user can be with By the academic circle with logical relation, the logical relation between required knowledge and required knowledge is quickly and effectively got, Related fields information can comprehensively be understood, provided for user and accurately effectively found potential affiliate support is provided, it can be with Aid decision etc. is provided for selecting for scientific and technological evaluation expert.

Meanwhile when extracting entity, be equivalent to and weed out the invalid information in initial data source, retain effective information with All types of entities is established, the validity of solid data can be improved, and then improves the accuracy of constructed academic circle data.

Moreover, can also be improved solid data by concentrating author's entity of the same name to carry out disambiguation processing solid data Accuracy, and then improve the accuracy of academic circle data.

Further, the detailed process of the step 3 are as follows:

Step 3.1, author's entity is expressed as the feature vector being made of its attribute value；

Step 3.2, all author's entities of the same name are taken, it is similar between any two author's entity of the same name by calculating Degree, and compared with similarity threshold, the maximum similarity value greater than similarity threshold is taken, it will be two corresponding to maximum similarity value A author's entity cluster of the same name is cluster, obtains author's entity set；

Wherein, the calculating formula of similarity between any two author of the same name are as follows:

S_ijIndicate two author's entity a of the same name_iWith author's entity a_jBetween similarity, sim_attr() indicates similarity Calculate function；

Step 3.3, other any author's entities that the author's entity set obtained with previous step is of the same name are taken, if with author's reality Similarity between any of body collection author's entity is greater than similarity threshold, then obtains author's entity addition previous step Author's entity set in；

Step 3.4, it by remaining author's entity of the same name, is handled again by step 3.2 to step 3.3, until to institute There is author's Entities Matching of the same name to corresponding author's entity set；

Step 3.5, all author's entities in same author's entity set are merged into same author's entity, and to obtain Author entity setting up author id；And the author id of author's entity in all different authors entity sets is set as different.

Further, author's entity is expressed as the feature vector as composed by following attribute value, the following attribute value It include: authors' name, scientific research field, affiliated unit and co-author.

Further, the academic paper information is by using crawler technology from web of science bibliographic data base It acquires, the academic periodical information is acquired from letpub webpage by using crawler technology, and academic paper is believed Breath and academic periodical information are stored in respectively in the different files of identical csv format.

Widely distributed and low the degree of association academic paper information and academic periodical information are obtained, using crawler technology with building Entity simultaneously establishes entity relationship based on public attribute, can simplify the data framework of academic circle, so that the availability of academic circle is more It is high.

Further, the entity information of pre-selection entity type is extracted in step 2 from initial data source, constitutes solid data The detailed process of collection are as follows:

Step 2.1, initial data source is imported in database；

Step 2.2, data are extracted from initial data source:

Data are extracted from the academic paper information of initial data source in the database: paper name, paper keyword, scientific research Field, author, time, journal title, periodical id；Data are extracted from the academic periodical information of initial data source in the database: Journal title, periodical id, impact factor, subregion；

Step 2.3, all paper entities, author's entity and periodical entity, structure are extracted from the data that step 2.2 is extracted At entity data set；

Wherein, the paper entity obtained includes attribute: paper name, paper id, author, time, journal title, periodical id；? To author's entity include attribute: authors' name, co-author, scientific research field, affiliated unit；Obtained periodical entity includes attribute: Journal title, periodical id, impact factor, subregion；The co-author is when extracting author's entity from academic paper information, to extract opinion The communication author and the first authors of text obtain；

Each attribute of each entity is saved according to triple form are as follows: (entity, attribute-name, attribute value).

Further, the detailed process of the step 4 are as follows:

Step 4.1, the file that all entities that solid data is concentrated are exported as to csv format from database, then leads Enter into Neo4j chart database, the corresponding entity of each id is respectively formed an entity node in Neo4j chart database；

Step 4.2, using attributive character public between different entities, extract the relationship between different entities: difference is made Being between person's entity is to deliver between relationship, periodical entity and paper entity between cooperative relationship, author's entity and paper entity For the relationship of including；

Step 4.3, in Neo4j chart database, will have between related entity node using corresponding relation type While being attached.

Beneficial effect

This programme utilizes Neo4j by extracting the entity of author, 3 seed type of paper and periodical from initial data source Chart database constructs entity node；It then is difference using the public attribute feature between different entities in Neo4j chart database Entity node opening relationships side obtains the academic circle of knowledge based map.It is equivalent to three kinds of paper, author and periodical inhomogeneities Relationship is connected in a relational network between the entity and entity of type, and composition is mutually related academic circle, and then user can be with By the academic circle with logical relation, the logical relation between required knowledge and required knowledge is quickly and effectively got, Related fields information can comprehensively be understood, provided for user and accurately effectively found potential affiliate support is provided, it can be with Aid decision etc. is provided for selecting for scientific and technological evaluation expert.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the invention.

Specific embodiment

Elaborate below to the embodiment of the present invention, the present embodiment with the technical scheme is that according to development, The detailed implementation method and specific operation process are given, is further explained explanation to technical solution of the present invention.

The academic circle construction method of a kind of knowledge based map provided by the invention, by extracting data, extraction entity, building Vertical entity relationship, is connected to a network of personal connections for relationship between paper, three kinds of different types of entities of author and periodical and entity In network, the academic circle that is mutually related is constituted, and then user can quickly and effectively obtain by the academic circle with logical relation The logical relation between required knowledge and required knowledge is got, related fields information can be comprehensively understood.

The present invention is based on the academic circle construction methods of knowledge mapping, as shown in Figure 1, comprising the following steps:

For the authenticity of data, the present embodiment is carried out from web of science bibliographic data base using crawler technology Crawling for data obtains academic paper information, and data are carried out from letpub webpage crawls acquisition academic periodical information, and learns Art paper information and academic periodical information are stored in respectively in different excel tables.

Academic paper information includes paper name, author, journal title and scientific research field etc..It is read crawling academic paper information When paper txt file, continues if reading file is out of question, paper text is re-read if reading file and having omission Part.And Web of Science bibliographic data base is only supported once to download 500 information, it is therefore desirable to recycle every 500 information Downloading is primary, and downloading click export can be obtained the academic paper information list using csv form document as format every time, will crawl Data be written in csv formatted file, and by selecting tab-delimited critical field, and separate the time and refreshed.So The data crawled are put into inside excel table afterwards, every a line represents the relevant information of an academic paper.Analyze specific field Information, and will there is the field of multiple data to separate in each column, obtain final excel corresponding with academic paper information Form document.

Academic periodical information includes journal title, impact factor, subregion etc., wherein impact factor and subregion are to judge periodical Horizontal index.To academic periodical information crawl and store method, identical as academic paper information, details are not described herein.

Step 2, the entity information of pre-selection entity type is extracted from initial data source, constitutes entity data set；

In the huge initial data source of data volume, it is the data for not actually using value compared with multi information, is constructed Not only construction work amount is big in science circle, and many and diverse influence of academic circle data made uses, therefore the present invention has needle To property data therein are pre-processed and cleaned, unwanted data are got rid of, leave important data.Such as it will The data processings such as article's style, languages and special issue are fallen, and leave the useful informations such as authors' name, scientific research field, paper keyword.

Step 2.1, using the management software of database, by the academic paper information and academic periodical information in excel table It imported into database；

Step 2.2, data are extracted from initial data source:

Data are extracted from the academic paper information of initial data source in the database: paper name, paper id, scientific research neck Domain, author, author affiliated unit, time, journal title, periodical id；In the database from the academic periodical information of initial data source Middle extraction data: journal title, periodical id, impact factor, subregion；

Each attribute of each entity is saved according to triple form are as follows: entity-attribute-name-attribute value.For example, Three-units-Central South University constitutes the triple sample of one (entity, attribute-name, attribute value).

Step 3, author's entity of the same name is concentrated to solid data, is carried out at disambiguation of the same name based on mutual similarity Reason, and author id is set；

The present invention converts clustering problem for author's disambiguation problem to realize.

Step 3.1, author's entity is expressed as composed by authors' name, scientific research field, affiliated unit and co-author Feature vector；

Using Word2Vec tool by this 4 attributes of the authors' name of author's entity, scientific research field, affiliated unit and co-author Feature is respectively trained as term vector, and each term vector is normalized to the decimal between (0,1), then 4 are normalized Decimal composition characteristic vector afterwards is used to indicate author's entity；

Step 3.3, other any author entities of the same name with author's entity set are taken, if making with any of author's entity set Similarity between person's entity is greater than similarity threshold, then author's entity set is added in author's entity；

Step 3.4, it by remaining author's entity of the same name, is handled by step 3.2 to step 3.3, until to all same Name author's Entities Matching is to corresponding author's entity set；

Particularly, if it is to the similarity calculation between two author's entity sets of the same name, it is similar that the present invention defines its Spend function are as follows: arbitrarily take author's entity from two author's entity sets, carry out after calculating two-by-two, take maximum therein similar Similarity between angle value author's entity set of the same name as two, formula indicate are as follows:

S_pqIndicate two author's entity set c of the same name_pWith author's entity set c_qBetween similarity, a_iAnd a_jIt respectively indicates Author's entity set c_pWith author's entity set c_qIn author's entity.

Step 4, the academic circle of knowledge based map is constructed；

The entity data set obtained after disambiguation of the same name processing is stored in Neo4j chart database, entity node is formed；Base Public attribute feature between different entities, the opening relationships side between different entities node, finally obtains knowledge based map Science circle.Specifically:

Step 4.1, the file that all entities that solid data is concentrated are exported as to csv format from database, then leads Enter into Neo4j chart database, the corresponding entity of each id is respectively formed an entity node in Neo4j chart database.

Neo4j is a high performance NOSQL graphic data base, it by structural data be stored on network rather than table In.It is one it is Embedded, based on disk, have the Java persistence engine of complete transactional attribute.Neo4j can also be with It is counted as a high performance figure engine, which has all characteristics of mature database.It can will be academic using neo4j Circle visualization, so that the knowledge mapping of academic circle is constructed, and the relationship between entity can very easily be established by neo4j.

Wherein, the above-mentioned file by csv format imported into the step in Neo4j chart database, specifically certainly using Neo4j The create sentence of band, the solid data in csv formatted file is imported into Neo4j chart database, and corresponding entity is formed Entity node.

Step 4.2, using attributive character public between different entities, extract the relationship between different entities: difference is made Being between person's entity is to deliver between relationship, periodical entity and paper entity between cooperative relationship, author's entity and paper entity For the relationship of including.

It is cooperative relationship between several authors in same piece paper；Paper is included by some periodical, to include and being included Relationship；It is to deliver relationship between paper and its author.For example, in paper entity attributes include journal title and periodical id, Therefore can use this attribute includes relationship between paper entity and corresponding periodical entity to construct.Specific entity closes System can be created by the where sentence of Neo4j.

Above embodiments are preferred embodiment of the present application, those skilled in the art can also on this basis into The various transformation of row or improvement these transformation or improve this Shen all should belong under the premise of not departing from the application total design Within the scope of please being claimed.

Claims

1. a kind of academic circle construction method of knowledge based map, which comprises the following steps:

Step 2, the entity information of pre-selection entity type is extracted from initial data source, constitutes entity data set；The pre-selection is real Body type includes author, paper and periodical；

Step 3, author's entity of the same name is concentrated to solid data, and disambiguation processing of the same name is carried out based on mutual similarity；

Step 4, the entity data set obtained after disambiguation of the same name processing is stored in Neo4j chart database, forms entity node； Based on the public attribute feature between different entities, the opening relationships side between different entities node finally obtains knowledge based map Academic circle.

2. the method according to claim 1, wherein the detailed process of the step 3 are as follows:

Step 3.2, all author's entities of the same name are taken, by calculating the similarity between any two author's entity of the same name, And compared with similarity threshold, the maximum similarity value greater than similarity threshold is taken, by two corresponding to maximum similarity value Author's entity cluster of the same name is cluster, obtains author's entity set；

S_ijIndicate two author's entity a of the same name_iWith author's entity a_jBetween similarity, sim_attr() indicates similarity calculation Function；

Step 3.3, other any author's entities that the author's entity set obtained with previous step is of the same name are taken, if with author's entity set Any of similarity between author's entity be greater than similarity threshold, then the work obtained author's entity addition previous step In person's entity set；

Step 3.4, it by remaining author's entity of the same name, is handled again by step 3.2 to step 3.3, until to all same Name author's Entities Matching is to corresponding author's entity set；

Step 3.5, all author's entities in same author's entity set are merged into same author's entity, and the work to obtain Person entity setting up author id；And the author id of author's entity in all different authors entity sets is set as different.

3. according to the method described in claim 2, it is characterized in that, author's entity is expressed as composed by following attribute value Feature vector, the following attribute value include: authors' name, scientific research field, affiliated unit and co-author.

4. the method according to claim 1, wherein the academic paper information by using crawler technology from It is acquired in web of science bibliographic data base, the academic periodical information is by using crawler technology from letpub net It is acquired in page, and academic paper information and academic periodical information are stored in respectively in the different files of identical csv format.

5. the method according to claim 1, wherein pre-selection entity class is extracted in step 2 from initial data source The entity information of type constitutes the detailed process of entity data set are as follows:

Step 2.1, initial data source is imported in database；

Step 2.2, data are extracted from initial data source:

Data are extracted from the academic paper information of initial data source in the database: paper name, paper keyword, scientific research neck Domain, author, time, journal title, periodical id；Data are extracted from the academic periodical information of initial data source in the database: the phase Print name, periodical id, impact factor, subregion；

Step 2.3, all paper entities, author's entity and periodical entity are extracted from the data that step 2.2 is extracted, and are constituted real Volumetric data set；

Wherein, the paper entity obtained includes attribute: paper name, paper id, author, time, journal title, periodical id；It obtains Author's entity includes attribute: authors' name, co-author, scientific research field, affiliated unit；Obtained periodical entity includes attribute: periodical Name, periodical id, impact factor, subregion；The co-author is when extracting author's entity from academic paper information, to extract paper Communication author and the first authors obtain；

6. the method according to claim 1, wherein the detailed process of the step 4 are as follows:

Step 4.1, the file that all entities that solid data is concentrated are exported as to csv format from database, is then introduced into In Neo4j chart database, the corresponding entity of each id is respectively formed an entity node in Neo4j chart database；

Step 4.2, using attributive character public between different entities, extract the relationship between different entities: different authors are real It is between cooperative relationship, author's entity and paper entity between body for deliver between relationship, periodical entity and paper entity be receipts Record relationship；

Step 4.3, in Neo4j chart database, will have between related entity node using corresponding relation type side into Row connection.