CN115269848A - Scientific and technical literature data classification method - Google Patents

Scientific and technical literature data classification method Download PDF

Info

Publication number
CN115269848A
CN115269848A CN202210927468.XA CN202210927468A CN115269848A CN 115269848 A CN115269848 A CN 115269848A CN 202210927468 A CN202210927468 A CN 202210927468A CN 115269848 A CN115269848 A CN 115269848A
Authority
CN
China
Prior art keywords
literature
document
scientific
documents
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210927468.XA
Other languages
Chinese (zh)
Inventor
李小英
郑浩
黎超平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhanjiang Zhizhikang Space Planning Consulting Co ltd
Original Assignee
Zhanjiang Zhizhikang Space Planning Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhanjiang Zhizhikang Space Planning Consulting Co ltd filed Critical Zhanjiang Zhizhikang Space Planning Consulting Co ltd
Priority to CN202210927468.XA priority Critical patent/CN115269848A/en
Publication of CN115269848A publication Critical patent/CN115269848A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a scientific and technical literature data classification method, and relates to the technical field of literature classification; the method aims to solve the problem that when a user does not have specific keywords to be searched or related contents cannot be detected by a knowledge base after corresponding keywords are input, the knowledge base which is classified only through the keywords cannot meet the requirements of the user; the method specifically comprises the following steps: the technical literature to be classified is transmitted to a database specially used for storing the literature through a network, and the technical literature is analyzed to obtain the title, the author of the literature, the creation mechanism of the literature, the publication date of the literature and the word number information of the file in the literature. According to the method and the system, when the user does not have specific keywords to be searched or the knowledge base cannot detect related contents after inputting the corresponding keywords, the user can search the documents through the related descriptive sentences, and the document base can better provide related search services for the user after using the classification method.

Description

Scientific and technical literature data classification method
Technical Field
The invention relates to the technical field of literature classification, in particular to a scientific and technical literature data classification method.
Background
The big data mining of the scientific and technological literature is a hot problem of research in the field of data mining at present, and how to accurately and efficiently classify the big data of the scientific and technological literature is a key problem of the research in the field, and with the rapid development of scientific technology, a large number of scientific literatures such as scientific papers and patents are continuously emerged. For some companies or enterprises, searching in a plurality of network libraries is needed, so that the search of documents in the internet cannot meet the requirements of the users, and more companies, enterprises and groups begin to build own scientific and technical document knowledge bases in the face of massive documents.
Through retrieval, a Chinese patent with the application number of CN202110554334.3 discloses a scientific and technological literature classification method based on knowledge graph, which comprises the following steps: a document acquisition step: acquiring scientific and technical documents to be classified; text preprocessing: the technical scheme is that the scientific and technical documents are classified only by extracting relevant keywords in the documents, so that a user can only retrieve the documents through a knowledge base by relatively definite keywords, and the problem that the knowledge base classified only through the keywords cannot meet the requirements of the user when the user does not have definite keywords to be searched or the knowledge base cannot detect relevant contents after inputting the corresponding keywords is also existed.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a scientific and technical literature data classification method.
In order to achieve the purpose, the invention adopts the following technical scheme:
a scientific and technical literature data classification method comprises the following steps:
s1: transmitting the scientific and technical literature to be classified to a database specially used for storing the literature through a network;
s2: analyzing the scientific and technical literature to obtain information of titles, literature authors, literature creation mechanisms, publication dates and document word numbers in the literature, and then storing the information into a database for storing basic classification information through a network by taking the literature titles as tag names;
s3: continuously analyzing the scientific and technical literature to obtain keywords, key sentences, demonstration problem description and demonstration conclusion description corresponding to the literature, and then storing the information into a database for storing detailed classification information through a network by taking the literature title as a tag name;
s4: continuously analyzing the scientific and technical literature, intercepting a first drawing in the literature, and then storing the intercepted picture into a database for storing picture information through a network by taking a literature title as a tag name;
s5: making document titles, keywords, captured pictures, key sentences, demonstration problem descriptions and demonstration conclusion descriptions of the documents into corresponding label information, and storing the label information into a label retrieval database for retrieving information through a network;
s6: classifying documents in a database for storing documents according to the extracted keywords to form a basic document classification library;
s7: and classifying the documents in the database again according to the extracted key sentences, demonstration problem description and demonstration conclusion description to form an auxiliary document classification library so as to finish the classification of the scientific and technical documents.
Preferably: in the step S1, the database for storing the data stores the literature data and the label information by using a cloud storage technology.
Further: in S2, the basic document classification library and the auxiliary document classification library classify the searched documents by the document author and the document creation organization, and sort the searched documents by the document publication date and the document word count information.
Further preferably: in S3, when the keywords are not specially noted in the document, 3-5 words with the largest occurrence frequency in the document are set as the keywords for storage.
As a preferable aspect of the present invention: in the step S3, the first and last sentences of each paragraph in the text of the document are extracted and stored as key sentences, the number of the extracted sentences is 1-10 sentences, and when there are many paragraphs in the text of the document, the extraction is performed once every interval or every multiple paragraphs according to the number of the paragraphs.
Further preferred according to the invention are: in the step S3, the introduction in the extracted literature is stored as the demonstration problem description, and the conclusion part in the extracted literature is stored as the demonstration conclusion description.
As a still further scheme of the invention: in S4, if no picture is detected in the literature, the screenshot is cancelled, and the size of the picture is controlled to be 40-100kb.
On the basis of the scheme: in the step S5, after the user searches the corresponding document through the basic document classification library, the tag search database displays the title, the keyword, and the captured picture of the corresponding document.
On the basis of the foregoing scheme, it is preferable that: in the step S5, after the user searches the corresponding document through the auxiliary document classification library, the tag search database displays the title, the key sentence, the demonstration problem description, and the demonstration conclusion description of the corresponding document, and the user clicks the title of the corresponding document to check the document after searching a satisfactory result through the displayed information.
It is further preferable on the basis of the foregoing scheme that: in the step S2, the document is analyzed to determine whether the used characters of the document are chinese, and then the document written in foreign language is classified according to languages.
The beneficial effects of the invention are as follows:
1. when the user is not satisfied with the documents searched by using the basic document classification library or can not search the documents, the auxiliary document classification library can be used for searching again, the user only needs to input related descriptive sentences of the documents to be searched in a search column, then the auxiliary document classification library can compare and search the key sentences, the demonstration problem descriptions and the demonstration conclusion descriptions stored in the label search database according to the descriptive sentences input by the user, and then the result is displayed according to the similarity degree of the sentences, so that the user can search the documents through the related descriptive sentences, and the document library can better provide related search services for the user after using the classification method.
2. The method can sequence the searched documents through the word number and the publication time of the documents, and simultaneously display the titles, the keywords and the related pictures of the searched documents to the user, so that the user can clearly view the searched documents, and can quickly know the related contents of the searched documents, so that the user can quickly find the required documents, and the search efficiency is improved.
3. The method can automatically search the keywords, the key sentences, the demonstration problem description and the demonstration conclusion description of the corresponding literature under the condition that the keywords, the abstract, the introduction and the conclusion are not marked in the literature, so that the literature base can record scientific and technical literatures in different formats, a user can accurately search the corresponding literatures in different formats and file structures, and the recording and searching effects of the literature base are improved.
4. The label information of the corresponding literature is stored in the basic literature classification library through the extracted keywords, and then the label information of the corresponding literature is stored in the auxiliary literature classification library through the extracted key sentences, demonstration problem descriptions and demonstration conclusion descriptions, so that the basic label information and the auxiliary label information are stored separately, a user can select to use the basic literature classification library for document retrieval or use the auxiliary literature classification library for retrieval according to the self requirement, and the use experience effect of the user is improved.
5. The documents are classified according to the portable languages of the documents, people in different countries can use the document library to perform classified retrieval, and meanwhile, workers can classify the retrieved documents according to the languages, so that the retrieved results are more accurate, the use experience and the working efficiency of users are improved, and meanwhile, the documents in different languages can be classified and stored in the document library, and the document abundance of the document library is improved.
Drawings
Fig. 1 is a schematic flow chart of a scientific and technical literature data classification method according to the present invention.
Detailed Description
The technical solution of the present patent will be described in further detail with reference to the following embodiments.
Example 1:
a scientific and technical literature data classification method is shown in FIG. 1, and comprises the following steps:
s1: transmitting the scientific and technical literature to be classified to a database specially used for storing the literature through a network;
s2: analyzing the scientific and technical literature to obtain information of titles, literature authors, literature creation mechanisms, publication dates and document word numbers in the literature, and then storing the information into a database for storing basic classification information through a network by taking the literature titles as tag names;
s3: continuously analyzing the scientific and technical literature to obtain keywords, key sentences, demonstration problem description and demonstration conclusion description corresponding to the literature, and then storing the information into a database for storing detailed classification information through a network by taking the literature title as a tag name;
s4: continuously analyzing the scientific and technical literature, intercepting a first drawing in the literature, and then storing the intercepted picture into a database for storing picture information through a network by taking a literature title as a tag name;
s5: making document titles, keywords, captured pictures, key sentences, demonstration problem descriptions and demonstration conclusion descriptions of the documents into corresponding label information, and storing the label information into a label retrieval database for retrieving information through a network;
s6: classifying documents in a database for storing documents according to the extracted keywords to form a basic document classification library;
s7: and classifying the documents in the database again according to the extracted key sentences, the demonstration problem description and the demonstration conclusion description to form an auxiliary document classification library, thereby completing the classification of the scientific and technical documents.
In the S1, a database for storing data stores document data and label information by using a cloud storage technology;
in the S2, the basic literature classification library and the auxiliary literature classification library classify the searched literatures through literature authors and literature creation mechanisms, and the searched literatures are sorted through literature publication dates and document word number information;
in the S3, when the keywords are not specially noted in the literature, 3-5 words with the largest occurrence frequency in the literature are set as the keywords for storage;
in the step S3, the first sentence and the last sentence of each paragraph in the text of the document are extracted and stored as key sentences, the number of the extracted sentences is 1-10 sentences, and when the paragraphs in the text of the document are more, the extraction is performed once every interval to at most once according to the number of the paragraphs;
in the S3, the introduction in the extracted literature is stored as the description of the demonstration problem, and the conclusion part in the extracted literature is stored as the description of the demonstration conclusion;
in the S4, if no picture is detected in the literature, the screenshot is cancelled, and the size of the picture is controlled to be 40-100kb;
in the S5, after the user searches out the corresponding literature through the basic literature classification library, the label search database can display the title, the keywords and the captured picture of the corresponding literature;
in the step S5, after the user searches the corresponding document through the auxiliary document classification library, the tag search database displays the title, the key sentence, the demonstration problem description, and the demonstration conclusion description of the corresponding document, and the user clicks the title of the corresponding document to check the document after searching a satisfactory result through the displayed information.
When the user is not satisfied with the documents searched by using the basic document classification library or can not search the documents, the auxiliary document classification library can be used for searching again, the user only needs to input related descriptive sentences of the documents to be searched in the search column, then the auxiliary document classification library can compare and search the key sentences, demonstration problem descriptions and demonstration conclusion descriptions stored in the label search database according to the descriptive sentences input by the user, and then the result is displayed according to the similarity degree of the sentences, so that the user can search the documents through the related descriptive sentences, and the document library can better provide related search services for the user after using the classification method.
The method can sequence the searched documents through the word number and the publication time of the documents, and simultaneously display the titles, the keywords and the related pictures of the searched documents to the user, so that the user can clearly view the searched documents, and can quickly know the related contents of the searched documents, so that the user can quickly find the required documents, and the search efficiency is improved.
The method can automatically search the keywords, the key sentences, the demonstration problem description and the demonstration conclusion description of the corresponding document under the condition that the keywords, the abstract, the introduction and the conclusion are not marked in the document, so that the document library can record scientific and technical documents in different formats, a user can accurately search the documents in different formats and document structures, and the recording and searching effects of the document library are improved.
The label information of the corresponding document is stored in the basic document classification library through the extracted key words, and then the label information of the corresponding document is stored in the auxiliary document classification library through the extracted key sentences, the demonstration problem description and the demonstration conclusion description, so that the basic label information and the auxiliary label information are stored separately, a user can select to use the basic document classification library for document retrieval or use the auxiliary document classification library for retrieval according to the self requirement, and the use experience effect of the user is improved.
Example 2:
the embodiment of the method for classifying scientific and technical literature data is improved on the basis of the embodiment 1 as follows: in the S2, the document is analyzed, whether the characters used by the document are Chinese is judged, and then the document written by foreign language is classified according to languages;
the documents are classified according to the portable languages of the documents, people in different countries can use the document library to perform classified retrieval, and meanwhile, workers can classify the retrieved documents according to the languages, so that the retrieved results are more accurate, the use experience and the working efficiency of users are improved, and meanwhile, the documents in different languages can be classified and stored in the document library, and the document abundance of the document library is improved.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered as the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims (10)

1. A scientific and technical literature data classification method is characterized by comprising the following steps:
s1: transmitting the scientific and technical literature to be classified to a database specially used for storing the literature through a network;
s2: analyzing the scientific and technical literature to obtain information of titles, literature authors, literature creation mechanisms, publication dates and document word numbers in the literature, and then storing the information into a database for storing basic classification information through a network by taking the literature titles as tag names;
s3: continuously analyzing the scientific and technical literature to obtain keywords, key sentences, demonstration problem description and demonstration conclusion description corresponding to the literature, and then storing the information into a database for storing detailed classification information through a network by taking the literature titles as tag names;
s4: continuously analyzing the scientific and technical literature, intercepting a first drawing in the literature, and then storing the intercepted picture into a database for storing picture information through a network by taking a literature title as a tag name;
s5: making document titles, keywords, captured pictures, key sentences, demonstration problem descriptions and demonstration conclusion descriptions of the documents into corresponding label information, and storing the label information into a label retrieval database for retrieving information through a network;
s6: classifying documents in a database for storing documents according to the extracted keywords to form a basic document classification library;
s7: and classifying the documents in the database again according to the extracted key sentences, demonstration problem description and demonstration conclusion description to form an auxiliary document classification library so as to finish the classification of the scientific and technical documents.
2. The method for classifying scientific and technical literature data according to claim 1, wherein in the step S1, the database for storing data stores the literature data and the tag information by using a cloud storage technology.
3. The method as claimed in claim 1, wherein in S2, the base document classification library and the auxiliary document classification library classify the searched documents by the document author and the document creation organization, and the searched documents are sorted by the document publication date and the document word count information.
4. The method according to claim 1, wherein in S3, when no keyword is specifically noted in the document, 3-5 words with the largest occurrence frequency in the document are set as the keyword for storage.
5. A method as claimed in claim 1, wherein in S3, the first and last sentences of each paragraph in the text of the document are extracted and stored as key sentences, the number of the extracted sentences is 1-10 sentences, and when there are many paragraphs in the text of the document, the number of the extracted sentences is one or more at intervals.
6. The method according to claim 5, wherein in S3, the introduction in the extracted literature is stored as an demonstration problem description, and the conclusion part in the extracted literature is stored as an demonstration conclusion description.
7. The method as claimed in claim 1, wherein in S4, the screenshot is cancelled if no picture is detected in the document, and the size of the picture is controlled to be 40-100kb.
8. A method as claimed in claim 1, wherein in S5, after the user searches the corresponding document from the basic document classification library, the tag search database displays the title, the keyword, and the captured image of the corresponding document.
9. The method according to claim 8, wherein in S5, after the user searches the corresponding document through the auxiliary document classification library, the tag search database displays a title, a key sentence, an explanation question description, and an explanation conclusion description of the corresponding document, and the user searches a satisfactory result through the displayed information and then clicks the title of the corresponding document to view the document.
10. A scientific and technological literature data classification method according to claim 3, wherein in S2, the literature is analyzed to determine whether the literature is in chinese using characters, and then the literature written in foreign languages is classified according to languages.
CN202210927468.XA 2022-08-03 2022-08-03 Scientific and technical literature data classification method Pending CN115269848A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210927468.XA CN115269848A (en) 2022-08-03 2022-08-03 Scientific and technical literature data classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210927468.XA CN115269848A (en) 2022-08-03 2022-08-03 Scientific and technical literature data classification method

Publications (1)

Publication Number Publication Date
CN115269848A true CN115269848A (en) 2022-11-01

Family

ID=83748871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210927468.XA Pending CN115269848A (en) 2022-08-03 2022-08-03 Scientific and technical literature data classification method

Country Status (1)

Country Link
CN (1) CN115269848A (en)

Similar Documents

Publication Publication Date Title
CN109992645B (en) Data management system and method based on text data
CN107085583B (en) Electronic document management method and device based on content
CN107844493B (en) File association method and system
US10372718B2 (en) Systems and methods for enterprise data search and analysis
Andersen et al. Building a large corpus based on newspapers from the web
CN103473369A (en) Semantic-based information acquisition method and semantic-based information acquisition system
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN109871424B (en) Chinese academic research hotspot area information automatic extraction and map making method
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN111078839A (en) Structured processing method and processing device for referee document
CN114356967A (en) Professional information collection and analysis application platform
CN116561295A (en) Internet data extraction system
CN111881695A (en) Audit knowledge retrieval method and device
CN101894158A (en) Intelligent retrieval system
CN115269848A (en) Scientific and technical literature data classification method
WO2021241601A1 (en) Information retrieval system
JPH01304575A (en) Document processing device
CN112241463A (en) Search method based on fusion of text semantics and picture information
Hast et al. Making large collections of handwritten material easily accessible and searchable
Bolatbek et al. Creating the dataset of keywords for detecting an extremist orientation in web-resources in the Kazakh language
Tanaka et al. Constructing a public meeting corpus
Asfoor Applying Data Science Techniques to Improve Information Discovery in Oil And Gas Unstructured Data
JP7004123B1 (en) Information retrieval system
WO2021241602A1 (en) Information search system
Asfoor et al. Unleash the Potential of Upstream Data Using Search, AI and Computer Vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination