CN112052411A - Crawler-based academic search result visualization scheme design method - Google Patents

Crawler-based academic search result visualization scheme design method Download PDF

Info

Publication number
CN112052411A
CN112052411A CN202010805474.9A CN202010805474A CN112052411A CN 112052411 A CN112052411 A CN 112052411A CN 202010805474 A CN202010805474 A CN 202010805474A CN 112052411 A CN112052411 A CN 112052411A
Authority
CN
China
Prior art keywords
documents
document
displaying
search
author
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010805474.9A
Other languages
Chinese (zh)
Inventor
盛斌
黄一帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202010805474.9A priority Critical patent/CN112052411A/en
Publication of CN112052411A publication Critical patent/CN112052411A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a crawler-based academic search result visualization scheme design method, which comprises the following steps: according to a search word input by a user, based on a web crawler technology, crawling documents related to the search word, including document names, document authors, publication time, publication publications, times of citations and a document list citing the documents, and storing the documents in a database; crawling the documents in a document list which refers to the documents by adopting a web crawler technology until all documents related to the search terms are crawled, and storing the documents into a database; for documents in the database, a deduplication technology is adopted for deduplication; and displaying the documents after the past duplication by adopting a visualization method. The invention adopts a data visualization technology and shows document crawling results and the relation between the results in the modes of charts, animations and the like. By showing the time sequence and the reference relation among the documents, the user can select the important academic documents to read more intuitively.

Description

Crawler-based academic search result visualization scheme design method
Technical Field
The invention relates to a crawler-based academic search result visualization scheme design method, and belongs to the technical field of data visualization.
Background
At present, many excellent academic search engines exist at home and abroad, such as hundred-degree academia, dog search academia, ten-thousand-square, and Homing. Most of the academic search engines adopt a traditional mode in the aspect of result presentation, namely document information is presented item by item in a list form. Additionally, a sorting mode based on attributes such as publication time, reference amount, correlation and the like is provided. The HowNet and the Baidu academic engine also provide a high-level search function, and the search range is further limited by the search conditions such as available time, authors, publication types and the like of users, so that more accurate search results are obtained. For visualization applications, large engines are also involved, such as Baidu academic for single articles, and trend charts are provided to show the annual quotation amount of the article to show the quotation trend in recent years. The cooperation map of the author can also be displayed in the student page of the encyclopedia. All parties provide a hot search word scroll map for displaying the current popular research field.
However, these mainstream academic search engines rarely visualize the presentation of paper search results. Once microsoft academia made some visualization of the academia search, the academia engine has stopped serving in recent years.
At present, a plurality of software which can be used for free trial by scholars, and is used for counting thesis and scholars and constructing knowledge maps is available at home and abroad. For example, CitNetExplorer and Vosviewer, which can analyze high-quality documents, Bibliometrix, which can be selected by itself for the layout of a visualization interface, Sci2, which can visualize information from dimensions such as time and geographic distribution, and the like. However, most of these software are desktop software developed based on Java or R language, and cannot directly relate to search engine, and their thesaurus and scholars libraries may not be complete.
When the search results are displayed by the current academic search engine, most of the academic search engines adopt entry formulas, the search results are often sorted according to a single index only, and the reference relationship among documents cannot be visually displayed. Therefore, in order to solve this drawback, a new visualization scheme of search results needs to be designed.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a crawler-based academic search result visualization scheme design method is provided, and a crawler technology is adopted to crawl papers (including various attribute data of the papers) and student information. And displaying the results and the relation between the results in the modes of charts, animations and the like by adopting a data visualization technology.
The invention adopts the following technical scheme for solving the technical problems:
a crawler-based academic search result visualization scheme design method comprises the following steps:
step 1, crawling documents related to search terms based on a web crawler technology according to the search terms input by a user, wherein the documents comprise: the name of the document, the author of the document, the publication time, publication publications, the times of citation, a list of the documents citing the document, and storing the documents in a database;
step 2, crawling the documents in the document list which quotes the documents by adopting a web crawler technology until all the documents related to the search terms are crawled, and storing the documents into a database;
3, removing the duplicate of the document in the database by adopting a duplicate removal technology;
and 4, displaying the document subjected to the duplicate removal in the step 3 by adopting a visualization method.
As a preferred embodiment of the present invention, the documents in the list of documents cited in step 2 include: name of the document, author of the document, publication time, publication, number of times of citation, list of documents citation the document.
As a preferred embodiment of the present invention, the specific process of step 4 is:
when the search words input by the user are keywords, displaying the documents subjected to the duplication removal in the step 3 in a two-dimensional scatter diagram mode, wherein each point on the two-dimensional scatter diagram represents one document, the abscissa of the two-dimensional scatter diagram is publication time of the document, and the ordinate is the number of times of citation of the document;
displaying the citation relationship among the documents by using the circular diagram, distributing each point on the two-dimensional scatter diagram to an arc section of the circular diagram, and displaying the citation relationship through directed edges;
counting publications of each document on the two-dimensional scatter diagram, and displaying by using a circle packing diagram, wherein each circle on the circle packing diagram represents a publication, and the size of the circle depends on the number of the publications;
when the search word input by the user is the name of the author, displaying the documents subjected to the duplication removal in the step 3 by bubbles with different sizes, wherein one bubble represents one author, and the size of the bubble depends on the number of times of citations of the documents of the author; counting the number of times of cooperation between authors, and showing the cooperation relationship between authors by using a force guide graph;
displaying the academic level of the author by using the author-issued quality chart;
acquiring member composition of a certain research team, counting the total number of published documents and the number of times of cited documents of the research team, and displaying by using a circle packing diagram.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. aiming at the defects of the current academic search engine in the aspect of result display, the invention designs a multi-dimensional visualization scheme. And crawling the thesis (including attribute data of the thesis) and the student information by adopting a crawler technology. And displaying the results and the relation between the results in the modes of charts, animations and the like by adopting a data visualization technology. By showing the time sequence and the reference relation among the documents, the user can select the important academic documents to read more intuitively.
2. The invention also designs a function of self-defining the threshold value by the user, and the user can self-specify the high-citation standard, the quantity of returned results and the like. By visually displaying the search information of the user, the user can conveniently obtain information in multiple aspects such as a reused thesis, an active scholarer, a research team, a main publication and the like.
Drawings
FIG. 1 is a flow chart of a crawler-based academic search result visualization scheme design method of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Fig. 1 is a flow chart of a method for designing a crawler-based academic search result visualization scheme according to the present invention. Firstly, according to search words input by a user, relevant literature information including authors, publication time, publication publications, quoted amount, a literature list for quoting the relevant literature information is crawled based on a web crawler and stored in a database. This information is also crawled for documents in the list of documents that reference it until all documents relevant to the search term have been crawled. And carrying out de-duplication treatment on the crawled documents. And then, realizing a visualization function by adopting visualization software, and displaying the crawled data.
According to the analysis of the requirements of scientific researchers, the invention designs the following visualization functions.
1. Paper retrieval map
And (4) inputting a keyword search by a user, and displaying the related documents obtained by crawling on a screen in a two-dimensional scatter diagram mode. Each point in the scatter diagram represents a paper, and the abscissa of the scatter diagram is the publication year of the paper and the ordinate is the number of citations of the article. Wherein the initial settings of the published year value range and the quote time value range are dynamically controlled by the displayed extreme value of the scatter point. In addition, a user-defined function is provided, so that the coordinate axis changes along with the increase and decrease of scattered points in the graph.
Different colors are scattered according to different types of documents, such as academic papers, journal papers, books, conference papers, patent documents and the like.
In terms of interactivity, when the user mouse slides to a certain point, the user is presented with specific information of the paper. And if the user clicks the scatter point, opening an academic page of the thesis and providing downloading.
2. Learner search map
The user can input the name of a student for searching, and can also input a domain keyword for searching important students in the research domain. Different scholars are represented by bubbles with different sizes, and the sizes of the bubbles depend on the paper introduction amount of the scholars. The user clicks on the bubble and more information about the relevant scholars is presented.
In order to make the retrieval result more accurate, a user-defined function is provided, and the lower limit of the number of published results and the total number of quoted times of a scholars is set.
3. Paper reference relationship diagram
In order to more clearly show the reference relationships between the papers, a circular graph is used to show the reference dependencies between the papers. And distributing the papers in the scatter diagram to arc segments of the circular diagram, and showing the reference relationship through directed edges.
In terms of interactivity, when the user mouse slides to a certain point on the arc, the user is presented with specific information of the paper.
4. Scholars' cooperation relation diagram
Although the scholars search the drawings, the cooperation relationship among the scholars cannot be shown. The student cooperation relationship graph shows the information by counting the number of cooperation times among students and adopting a force guide graph.
5. Published source map
For the novice academics, it is necessary to know which periodicals and meetings are authoritative publications in their own domain. The design counts publication publications of each paper, and uses circle packing diagram to show important publications in the relevant field of search, wherein each circle represents a publication, and the size of the circle depends on the number of papers published in the publication in the search result.
6. Student's quality chart
The academic level of a scholars cannot be accurately reflected by the total number of papers or the total quoted amount. Inspired by h-index and g-index, the scholars are provided with a quality chart to show the academic level of the scholars. Due to the difference between different fields, the design provides a lower limit of the number of times of the cited papers defined by the user to count the number of high-level papers of each student.
7. Research team drawing
Nowadays, the circle of cooperation between scholars is more and more extensive, and members of a research team may be distributed in the five lakes and four seas. For researchers, it may be a shortcut to inquire published records of famous research teams in a certain field if they want to quickly understand the development process and the pioneering results in the field. The method comprises the steps of setting a lower limit on the number of times of cooperation among scholars, clustering to obtain member composition of a research team, counting the total number of results and the number of times of quoted team, and visually displaying the information by using a circle packing diagram.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (3)

1. A crawler-based academic search result visualization scheme design method is characterized by comprising the following steps:
step 1, crawling documents related to search terms based on a web crawler technology according to the search terms input by a user, wherein the documents comprise: the name of the document, the author of the document, the publication time, publication publications, the times of citation, a list of the documents citing the document, and storing the documents in a database;
step 2, crawling the documents in the document list which quotes the documents by adopting a web crawler technology until all the documents related to the search terms are crawled, and storing the documents into a database;
3, removing the duplicate of the document in the database by adopting a duplicate removal technology;
and 4, displaying the document subjected to the duplicate removal in the step 3 by adopting a visualization method.
2. The method for designing a crawler-based academic search result visualization scheme according to claim 1, wherein the documents in the document list referring to the documents in step 2 comprise: name of the document, author of the document, publication time, publication, number of times of citation, list of documents citation the document.
3. The method for designing a crawler-based academic search result visualization scheme according to claim 1, wherein the specific process of the step 4 is as follows:
when the search words input by the user are keywords, displaying the documents subjected to the duplication removal in the step 3 in a two-dimensional scatter diagram mode, wherein each point on the two-dimensional scatter diagram represents one document, the abscissa of the two-dimensional scatter diagram is publication time of the document, and the ordinate is the number of times of citation of the document;
displaying the citation relationship among the documents by using the circular diagram, distributing each point on the two-dimensional scatter diagram to an arc section of the circular diagram, and displaying the citation relationship through directed edges;
counting publications of each document on the two-dimensional scatter diagram, and displaying by using a circle packing diagram, wherein each circle on the circle packing diagram represents a publication, and the size of the circle depends on the number of the publications;
when the search word input by the user is the name of the author, displaying the documents subjected to the duplication removal in the step 3 by bubbles with different sizes, wherein one bubble represents one author, and the size of the bubble depends on the number of times of citations of the documents of the author; counting the number of times of cooperation between authors, and showing the cooperation relationship between authors by using a force guide graph;
displaying the academic level of the author by using the author-issued quality chart;
acquiring member composition of a certain research team, counting the total number of published documents and the number of times of cited documents of the research team, and displaying by using a circle packing diagram.
CN202010805474.9A 2020-08-12 2020-08-12 Crawler-based academic search result visualization scheme design method Pending CN112052411A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010805474.9A CN112052411A (en) 2020-08-12 2020-08-12 Crawler-based academic search result visualization scheme design method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010805474.9A CN112052411A (en) 2020-08-12 2020-08-12 Crawler-based academic search result visualization scheme design method

Publications (1)

Publication Number Publication Date
CN112052411A true CN112052411A (en) 2020-12-08

Family

ID=73601575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010805474.9A Pending CN112052411A (en) 2020-08-12 2020-08-12 Crawler-based academic search result visualization scheme design method

Country Status (1)

Country Link
CN (1) CN112052411A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722472A (en) * 2021-09-16 2021-11-30 北京市科学技术情报研究所 Technical literature information extraction method, system and storage medium
US11630870B2 (en) 2020-01-06 2023-04-18 Tarek A. M. Abdunabi Academic search and analytics system and method therefor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064837A (en) * 2011-10-19 2013-04-24 西安邮电学院 Retrieval of leading figures in academic fields and visualized navigation system
CN103136337A (en) * 2013-02-01 2013-06-05 北京邮电大学 Distributed knowledge data mining device and mining method used for complex network
CN105718528A (en) * 2016-01-15 2016-06-29 上海交通大学 Academic map display method based on reference relationship among thesises
CN110502618A (en) * 2018-05-16 2019-11-26 北京理工大学 A kind of method for visualizing of document big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064837A (en) * 2011-10-19 2013-04-24 西安邮电学院 Retrieval of leading figures in academic fields and visualized navigation system
CN103136337A (en) * 2013-02-01 2013-06-05 北京邮电大学 Distributed knowledge data mining device and mining method used for complex network
CN105718528A (en) * 2016-01-15 2016-06-29 上海交通大学 Academic map display method based on reference relationship among thesises
CN110502618A (en) * 2018-05-16 2019-11-26 北京理工大学 A kind of method for visualizing of document big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵蓉英等: "学术搜索引擎Google Scholar和Microsoft Academic Search的比较研究", 《情报科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11630870B2 (en) 2020-01-06 2023-04-18 Tarek A. M. Abdunabi Academic search and analytics system and method therefor
CN113722472A (en) * 2021-09-16 2021-11-30 北京市科学技术情报研究所 Technical literature information extraction method, system and storage medium

Similar Documents

Publication Publication Date Title
Baeza-Yates Visualization of large answers in text databases
US7953732B2 (en) Searching by using spatial document and spatial keyword document indexes
CN109074383B (en) Document search with visualization within the context of a document
US20060152755A1 (en) Method, system and program product for managing document summary information
US20030115176A1 (en) Information system
US20150026159A1 (en) Digital Resource Set Integration Methods, Interfaces and Outputs
CN112052411A (en) Crawler-based academic search result visualization scheme design method
Terveen et al. Finding and visualizing inter-site clan graphs
WO2018226255A1 (en) Functional equivalence of tuples and edges in graph databases
Julien et al. Capitalizing on information organization and information visualization for a new-generation catalogue
Larson et al. The Sequoia 2000 electronic repository
Mamoon et al. Visualization for information retrieval based on fast search technology
Stelmaszewska et al. Electronic resource discovery systems: from user behaviour to design
Dontcheva et al. Collecting and organizing web content
Mass et al. Knowledge management for keyword search over data graphs
Keenan Geographic Information SystemsTheir contribution to the IS mainstream
Feeney What Can Text Mining Reveal about the Use of Newspapers in Research?
Chu et al. A treemap-based result interface for search engine users
Landbeck Access to editorial cartoons: the state of the art
Eckstein et al. Visual browsing in product development processes
Rontu et al. System for enhanced exploration and querying
Chakravarty Web search behaviour by the faculty members & students: A case study of Swami Devi Dyal Institute of Engineering (SDDIE)
Goldenberg Exploratory Search in WorkTop
YOSHIMURA Forming Wisdom of Crowds by Visualizing Web Pages
Nasharuddin et al. MetaVis: Metadata visualization using JUNG’S library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201208