CN112131394A - Scientific and technological achievement keyword network construction method and device - Google Patents

Scientific and technological achievement keyword network construction method and device Download PDF

Info

Publication number
CN112131394A
CN112131394A CN202010832606.7A CN202010832606A CN112131394A CN 112131394 A CN112131394 A CN 112131394A CN 202010832606 A CN202010832606 A CN 202010832606A CN 112131394 A CN112131394 A CN 112131394A
Authority
CN
China
Prior art keywords
keywords
scientific
semantic
words
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010832606.7A
Other languages
Chinese (zh)
Inventor
刘俊
郝翔宇
贺长昊
宋文乐
王磊
苏嘉成
米芝昌
张顺
孙朋朋
陈宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Hebei Electric Power Co Ltd
Cangzhou Power Supply Co of State Grid Hebei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Hebei Electric Power Co Ltd
Cangzhou Power Supply Co of State Grid Hebei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Hebei Electric Power Co Ltd, Cangzhou Power Supply Co of State Grid Hebei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202010832606.7A priority Critical patent/CN112131394A/en
Publication of CN112131394A publication Critical patent/CN112131394A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

One or more embodiments of the present disclosure provide a scientific and technological achievement keyword network construction method and apparatus, where scientific and technological achievement information is acquired, keywords are extracted from the scientific and technological achievement information, semantic association degrees among the keywords are measured, and a keyword network is established by using a predetermined semantic vector space model according to a general word bank, a special word bank, and an association relationship among the keywords and keywords. The embodiment can construct a keyword network.

Description

Scientific and technological achievement keyword network construction method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of information processing technologies, and in particular, to a method and an apparatus for building a keyword network of scientific and technological achievements.
Background
With the advance of the science and technology development strategy, a large number of scientific and technological achievements are obtained, scientific and correct evaluation is carried out on the scientific and technological achievements, innovative scientific and technological achievements are transferred and converted, further stimulation and innovation can be promoted, the technological progress is promoted, and economic progress is promoted.
Keywords are extracted from complex and diverse scientific and technological achievement information, a keyword library is constructed, and necessary data support can be provided for scientific and technological project analysis and achievement evaluation.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure are directed to a method and an apparatus for building a keyword network of scientific and technological achievement keywords, which can build the keyword network of the scientific and technological achievement keywords.
Based on the above purpose, one or more embodiments of the present specification provide a method for constructing a keyword network of scientific and technological achievements, including:
acquiring scientific and technological achievement information;
extracting key words from the scientific and technological achievement information;
measuring semantic association degree among the keywords;
and establishing a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
Optionally, the association relationship among the keywords includes a semantic similarity relationship, a semantic correlation relationship, and a top-bottom relationship.
Optionally, measuring the semantic association degree between the keywords includes: and calculating the semantic similarity between two words of all the keywords, and classifying similar words into one class.
Optionally, the similarity is calculated by adopting word distance-based calculation, and the calculation formula is as follows:
Figure BDA0002638539070000011
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
Optionally, the universal word bank includes words that may appear in all documents or database records in the power science and technology achievement resource bank; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
This embodiment still provides a scientific and technological achievement keyword network construction device, includes:
the acquisition module is used for acquiring scientific and technological achievement information;
the extraction module is used for extracting keywords from the scientific and technological achievement information;
the measuring module is used for measuring the semantic association degree between the keywords;
and the building module is used for building a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
Optionally, the association relationship among the keywords includes a semantic similarity relationship, a semantic correlation relationship, and a top-bottom relationship.
Optionally, the measuring module is configured to calculate semantic similarity between two words of all the keywords, and classify similar words into one category.
Optionally, the similarity is calculated by adopting word distance-based calculation, and the calculation formula is as follows:
Figure BDA0002638539070000021
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
Optionally, the universal word bank includes words that may appear in all documents or database records in the power science and technology achievement resource bank; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
As can be seen from the above description, the scientific and technological achievement keyword network construction method and apparatus provided in one or more embodiments of the present specification extract keywords from scientific and technological achievement information by obtaining the scientific and technological achievement information, measure semantic association between the keywords, and establish a keyword network by using a predetermined semantic vector space model according to a general word bank, a special word bank, and an association relationship between the keywords and the keywords. The embodiment can construct a keyword network.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1 is a schematic flow chart of a method according to one or more embodiments of the present disclosure;
fig. 2 is a schematic structural diagram of an apparatus according to one or more embodiments of the present disclosure.
Detailed Description
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
One or more embodiments of the present specification provide a scientific and technological achievement keyword network construction method, including:
s101: acquiring scientific and technological achievement information;
s102: extracting key words from scientific and technological achievement information;
s103: measuring semantic association degree among the keywords;
s104: and establishing a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank, the keywords and the incidence relation among the keywords.
In some embodiments, the association relationship between the keywords includes different association relationships such as semantic similarity relationship, semantic correlation relationship, upper and lower relationship, and the like.
The semantic similarity of the keywords is similar words, which means that the meanings of the two words are not completely equivalent but are close to each other, and the semantic similarity is a special condition of semantic correlation. The embodiment establishes a near-meaning word list and a similarity quantization degree by constructing a semantic model. When the document is searched and updated, the near-sense word table is established for the analysis of the professional field of the expert, the field of the project or the similar duplication checking of the project. For example, "extra-high voltage" and "extra-high voltage" belong to similar words, scientific and technological achievements contain one word, results hit by the other word can be searched, and overall result ordering is carried out according to semantic approximation degree.
The semantic correlation refers to the fact that different words are frequently co-occurring or have similar attributes such as context and the like, and reflects that the semantic correlation is very strong, for example, "apple" and "apple mobile phone", "extra-high voltage" and "1000 KV" are semantic correlation words, but the most common "extra-high voltage" is generally input, and documents related to "1000 KV" cannot be searched. The semantic relevance can enhance the flexibility of the scientific and technological achievement analysis system, and can be used for query recommendation or query expansion, guiding a user to eliminate ambiguity or guiding the user to browse related documents or scientific and technological data.
The superior-inferior relation expresses that the two vocabulary concepts have inclusion relation, such as 'power transmission and distribution' and 'transformer substation', the hierarchical and classification relation of the vocabularies can be constructed by finding the superior-inferior relation of the vocabularies in the power technology field, and the method has important significance in performing multi-dimensional analysis and other applications of technological achievements. The establishment of the context relation needs to rely on the existing subject classification, related industry standards and national standards on one hand, and needs to utilize statistical machine learning means to conduct mining analysis on the other hand, and then an expert manual review is assisted to construct a complete and accurate classification system.
In some embodiments, measuring semantic relatedness between keywords comprises: and calculating the semantic similarity between two words of all the keywords, and classifying similar words into one class. Some ways, a word distance based calculation is used to calculate the similarity, two words W1 and W2, the similarity is Sim (W1, W2), the word distance is Dis (W1, W2), and the calculation formula is:
Figure BDA0002638539070000041
where α is an adjustable parameter, which is the word distance value when the similarity is 0.5. In this embodiment, the semantic distance between two keywords is calculated by using a statistical method, and the distance between two keywords is measured by calculating the ratio of the number of times that the two keywords appear at the same time to the number of times that a single keyword appears. In some approaches, the synonyms and synonyms in the keyword may be identified through human intervention.
In some embodiments, a keyword network is constructed by extracting keyword parts in scientific and technical projects and electric power scientific and technical documents and establishing an association relationship between keywords appearing in the same project or document, and hot keywords in the research field are identified by using a centrality measure based on random walks.
In some embodiments, the universal thesaurus includes all possible words in documents or database records in the power science and technology achievement resource base, and the comprehensiveness of the word list determines the accuracy of searching and analyzing the science and technology achievement text. The universal word list is constructed by learning the existing scientific and technological achievement corpus and the electric power field dictionary and assisting manual verification and proofreading.
The special word bank is constructed by performing statistical mining on the basis of a hidden Markov model and a Conditional Random Field (CRFs) algorithm to obtain an initial version and assisting with expert review and repeated discussion according to the requirement of accurate professional keyword lists for scientific and technological analysis functions such as professional technical term word lists in the field of electric power science and technology.
As shown in fig. 2, this embodiment further provides a scientific and technological achievement keyword network construction device, including:
the acquisition module is used for acquiring scientific and technological achievement information;
the extraction module is used for extracting keywords from the scientific and technological achievement information;
the measuring module is used for measuring the semantic association degree between the keywords;
and the building module is used for building a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank, the keywords and the incidence relation among the keywords.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A scientific and technological achievement keyword network construction method is characterized by comprising the following steps:
acquiring scientific and technological achievement information;
extracting key words from the scientific and technological achievement information;
measuring semantic association degree among the keywords;
and establishing a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
2. The method of claim 1, wherein the association between the keywords comprises semantic similarity, semantic correlation, context.
3. The method of claim 1, wherein measuring semantic relatedness between keywords comprises: and calculating the semantic similarity between two words of all the keywords, and classifying similar words into one class.
4. The method of claim 3, wherein the similarity is calculated using a term distance based calculation, the formula for which is:
Figure FDA0002638539060000011
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
5. The method of claim 1, wherein the universal thesaurus comprises words that may occur in all documents or database records in an electric power technology achievement repository; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
6. A scientific and technological achievement keyword network construction device is characterized by comprising the following steps:
the acquisition module is used for acquiring scientific and technological achievement information;
the extraction module is used for extracting keywords from the scientific and technological achievement information;
the measuring module is used for measuring the semantic association degree between the keywords;
and the building module is used for building a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
7. The apparatus of claim 6, wherein the association relationship between the keywords comprises semantic similarity relationship, semantic correlation relationship, and context relationship.
8. The apparatus of claim 6,
and the measuring module is used for calculating the semantic similarity between two words of all the keywords and classifying the similar words into one class.
9. The apparatus of claim 8, wherein the similarity is calculated using a term distance based calculation, the formula of which is:
Figure FDA0002638539060000021
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
10. The apparatus of claim 6, wherein the universal thesaurus comprises all possible words in documents or database records in an electric power technology achievement resource base; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
CN202010832606.7A 2020-08-18 2020-08-18 Scientific and technological achievement keyword network construction method and device Pending CN112131394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010832606.7A CN112131394A (en) 2020-08-18 2020-08-18 Scientific and technological achievement keyword network construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010832606.7A CN112131394A (en) 2020-08-18 2020-08-18 Scientific and technological achievement keyword network construction method and device

Publications (1)

Publication Number Publication Date
CN112131394A true CN112131394A (en) 2020-12-25

Family

ID=73850980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010832606.7A Pending CN112131394A (en) 2020-08-18 2020-08-18 Scientific and technological achievement keyword network construction method and device

Country Status (1)

Country Link
CN (1) CN112131394A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641785A (en) * 2021-06-28 2021-11-12 北京邮电大学 Multi-dimension-based scientific and technological resource similar word retrieval method and electronic equipment
CN114780673A (en) * 2022-03-28 2022-07-22 西安远诺技术转移有限公司 Scientific and technological achievement management method and scientific and technological achievement management platform based on field matching

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641785A (en) * 2021-06-28 2021-11-12 北京邮电大学 Multi-dimension-based scientific and technological resource similar word retrieval method and electronic equipment
CN113641785B (en) * 2021-06-28 2023-08-01 北京邮电大学 Multi-dimensional technology resource similar word retrieval method and electronic equipment
CN114780673A (en) * 2022-03-28 2022-07-22 西安远诺技术转移有限公司 Scientific and technological achievement management method and scientific and technological achievement management platform based on field matching
CN114780673B (en) * 2022-03-28 2024-04-30 西安远诺技术转移有限公司 Scientific and technological achievement management method and platform based on field matching

Similar Documents

Publication Publication Date Title
CN110069610B (en) Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium
CN110019732B (en) Intelligent question answering method and related device
CN102156711B (en) Cloud storage based power full text retrieval method and system
CN104866572A (en) Method for clustering network-based short texts
CN106708929B (en) Video program searching method and device
CN107844493B (en) File association method and system
CN107665217A (en) A kind of vocabulary processing method and system for searching service
CN112988980B (en) Target product query method and device, computer equipment and storage medium
US20220261545A1 (en) Systems and methods for producing a semantic representation of a document
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN107301195A (en) Generate disaggregated model method, device and the data handling system for searching for content
CN112131394A (en) Scientific and technological achievement keyword network construction method and device
KR20140075428A (en) Method and system for semantic search keyword recommendation
CN109815390B (en) Method, device, computer equipment and computer storage medium for retrieving multilingual information
CN111078842A (en) Method, device, server and storage medium for determining query result
CA2817136A1 (en) Related-word registration and information processing device, method, recording medium and system
CN111611452A (en) Method, system, device and storage medium for ambiguity recognition of search text
CN103927339B (en) Knowledge Reorganizing system and method for knowledge realignment
CN108345694B (en) Document retrieval method and system based on theme database
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
GB2568575A (en) Document search using grammatical units
CN104111942B (en) Uighur medicine ancient books resource network searching platform
CN116108181A (en) Client information processing method and device and electronic equipment
CN107368525B (en) Method and device for searching related words, storage medium and terminal equipment
CN105512270A (en) Method and device for determining related objects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201225

WD01 Invention patent application deemed withdrawn after publication