CN112131394A - Scientific and technological achievement keyword network construction method and device - Google Patents
Scientific and technological achievement keyword network construction method and device Download PDFInfo
- Publication number
- CN112131394A CN112131394A CN202010832606.7A CN202010832606A CN112131394A CN 112131394 A CN112131394 A CN 112131394A CN 202010832606 A CN202010832606 A CN 202010832606A CN 112131394 A CN112131394 A CN 112131394A
- Authority
- CN
- China
- Prior art keywords
- keywords
- scientific
- semantic
- words
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 9
- 238000005516 engineering process Methods 0.000 claims description 14
- 238000000034 method Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004141 dimensional analysis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001915 proofreading effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000000638 stimulation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
One or more embodiments of the present disclosure provide a scientific and technological achievement keyword network construction method and apparatus, where scientific and technological achievement information is acquired, keywords are extracted from the scientific and technological achievement information, semantic association degrees among the keywords are measured, and a keyword network is established by using a predetermined semantic vector space model according to a general word bank, a special word bank, and an association relationship among the keywords and keywords. The embodiment can construct a keyword network.
Description
Technical Field
One or more embodiments of the present disclosure relate to the field of information processing technologies, and in particular, to a method and an apparatus for building a keyword network of scientific and technological achievements.
Background
With the advance of the science and technology development strategy, a large number of scientific and technological achievements are obtained, scientific and correct evaluation is carried out on the scientific and technological achievements, innovative scientific and technological achievements are transferred and converted, further stimulation and innovation can be promoted, the technological progress is promoted, and economic progress is promoted.
Keywords are extracted from complex and diverse scientific and technological achievement information, a keyword library is constructed, and necessary data support can be provided for scientific and technological project analysis and achievement evaluation.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure are directed to a method and an apparatus for building a keyword network of scientific and technological achievement keywords, which can build the keyword network of the scientific and technological achievement keywords.
Based on the above purpose, one or more embodiments of the present specification provide a method for constructing a keyword network of scientific and technological achievements, including:
acquiring scientific and technological achievement information;
extracting key words from the scientific and technological achievement information;
measuring semantic association degree among the keywords;
and establishing a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
Optionally, the association relationship among the keywords includes a semantic similarity relationship, a semantic correlation relationship, and a top-bottom relationship.
Optionally, measuring the semantic association degree between the keywords includes: and calculating the semantic similarity between two words of all the keywords, and classifying similar words into one class.
Optionally, the similarity is calculated by adopting word distance-based calculation, and the calculation formula is as follows:
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
Optionally, the universal word bank includes words that may appear in all documents or database records in the power science and technology achievement resource bank; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
This embodiment still provides a scientific and technological achievement keyword network construction device, includes:
the acquisition module is used for acquiring scientific and technological achievement information;
the extraction module is used for extracting keywords from the scientific and technological achievement information;
the measuring module is used for measuring the semantic association degree between the keywords;
and the building module is used for building a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
Optionally, the association relationship among the keywords includes a semantic similarity relationship, a semantic correlation relationship, and a top-bottom relationship.
Optionally, the measuring module is configured to calculate semantic similarity between two words of all the keywords, and classify similar words into one category.
Optionally, the similarity is calculated by adopting word distance-based calculation, and the calculation formula is as follows:
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
Optionally, the universal word bank includes words that may appear in all documents or database records in the power science and technology achievement resource bank; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
As can be seen from the above description, the scientific and technological achievement keyword network construction method and apparatus provided in one or more embodiments of the present specification extract keywords from scientific and technological achievement information by obtaining the scientific and technological achievement information, measure semantic association between the keywords, and establish a keyword network by using a predetermined semantic vector space model according to a general word bank, a special word bank, and an association relationship between the keywords and the keywords. The embodiment can construct a keyword network.
Drawings
In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.
FIG. 1 is a schematic flow chart of a method according to one or more embodiments of the present disclosure;
fig. 2 is a schematic structural diagram of an apparatus according to one or more embodiments of the present disclosure.
Detailed Description
For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.
It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
One or more embodiments of the present specification provide a scientific and technological achievement keyword network construction method, including:
s101: acquiring scientific and technological achievement information;
s102: extracting key words from scientific and technological achievement information;
s103: measuring semantic association degree among the keywords;
s104: and establishing a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank, the keywords and the incidence relation among the keywords.
In some embodiments, the association relationship between the keywords includes different association relationships such as semantic similarity relationship, semantic correlation relationship, upper and lower relationship, and the like.
The semantic similarity of the keywords is similar words, which means that the meanings of the two words are not completely equivalent but are close to each other, and the semantic similarity is a special condition of semantic correlation. The embodiment establishes a near-meaning word list and a similarity quantization degree by constructing a semantic model. When the document is searched and updated, the near-sense word table is established for the analysis of the professional field of the expert, the field of the project or the similar duplication checking of the project. For example, "extra-high voltage" and "extra-high voltage" belong to similar words, scientific and technological achievements contain one word, results hit by the other word can be searched, and overall result ordering is carried out according to semantic approximation degree.
The semantic correlation refers to the fact that different words are frequently co-occurring or have similar attributes such as context and the like, and reflects that the semantic correlation is very strong, for example, "apple" and "apple mobile phone", "extra-high voltage" and "1000 KV" are semantic correlation words, but the most common "extra-high voltage" is generally input, and documents related to "1000 KV" cannot be searched. The semantic relevance can enhance the flexibility of the scientific and technological achievement analysis system, and can be used for query recommendation or query expansion, guiding a user to eliminate ambiguity or guiding the user to browse related documents or scientific and technological data.
The superior-inferior relation expresses that the two vocabulary concepts have inclusion relation, such as 'power transmission and distribution' and 'transformer substation', the hierarchical and classification relation of the vocabularies can be constructed by finding the superior-inferior relation of the vocabularies in the power technology field, and the method has important significance in performing multi-dimensional analysis and other applications of technological achievements. The establishment of the context relation needs to rely on the existing subject classification, related industry standards and national standards on one hand, and needs to utilize statistical machine learning means to conduct mining analysis on the other hand, and then an expert manual review is assisted to construct a complete and accurate classification system.
In some embodiments, measuring semantic relatedness between keywords comprises: and calculating the semantic similarity between two words of all the keywords, and classifying similar words into one class. Some ways, a word distance based calculation is used to calculate the similarity, two words W1 and W2, the similarity is Sim (W1, W2), the word distance is Dis (W1, W2), and the calculation formula is:
where α is an adjustable parameter, which is the word distance value when the similarity is 0.5. In this embodiment, the semantic distance between two keywords is calculated by using a statistical method, and the distance between two keywords is measured by calculating the ratio of the number of times that the two keywords appear at the same time to the number of times that a single keyword appears. In some approaches, the synonyms and synonyms in the keyword may be identified through human intervention.
In some embodiments, a keyword network is constructed by extracting keyword parts in scientific and technical projects and electric power scientific and technical documents and establishing an association relationship between keywords appearing in the same project or document, and hot keywords in the research field are identified by using a centrality measure based on random walks.
In some embodiments, the universal thesaurus includes all possible words in documents or database records in the power science and technology achievement resource base, and the comprehensiveness of the word list determines the accuracy of searching and analyzing the science and technology achievement text. The universal word list is constructed by learning the existing scientific and technological achievement corpus and the electric power field dictionary and assisting manual verification and proofreading.
The special word bank is constructed by performing statistical mining on the basis of a hidden Markov model and a Conditional Random Field (CRFs) algorithm to obtain an initial version and assisting with expert review and repeated discussion according to the requirement of accurate professional keyword lists for scientific and technological analysis functions such as professional technical term word lists in the field of electric power science and technology.
As shown in fig. 2, this embodiment further provides a scientific and technological achievement keyword network construction device, including:
the acquisition module is used for acquiring scientific and technological achievement information;
the extraction module is used for extracting keywords from the scientific and technological achievement information;
the measuring module is used for measuring the semantic association degree between the keywords;
and the building module is used for building a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank, the keywords and the incidence relation among the keywords.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.
Claims (10)
1. A scientific and technological achievement keyword network construction method is characterized by comprising the following steps:
acquiring scientific and technological achievement information;
extracting key words from the scientific and technological achievement information;
measuring semantic association degree among the keywords;
and establishing a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
2. The method of claim 1, wherein the association between the keywords comprises semantic similarity, semantic correlation, context.
3. The method of claim 1, wherein measuring semantic relatedness between keywords comprises: and calculating the semantic similarity between two words of all the keywords, and classifying similar words into one class.
4. The method of claim 3, wherein the similarity is calculated using a term distance based calculation, the formula for which is:
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
5. The method of claim 1, wherein the universal thesaurus comprises words that may occur in all documents or database records in an electric power technology achievement repository; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
6. A scientific and technological achievement keyword network construction device is characterized by comprising the following steps:
the acquisition module is used for acquiring scientific and technological achievement information;
the extraction module is used for extracting keywords from the scientific and technological achievement information;
the measuring module is used for measuring the semantic association degree between the keywords;
and the building module is used for building a keyword network by utilizing a preset semantic vector space model according to the general word bank, the special word bank and the incidence relation among the keywords.
7. The apparatus of claim 6, wherein the association relationship between the keywords comprises semantic similarity relationship, semantic correlation relationship, and context relationship.
8. The apparatus of claim 6,
and the measuring module is used for calculating the semantic similarity between two words of all the keywords and classifying the similar words into one class.
9. The apparatus of claim 8, wherein the similarity is calculated using a term distance based calculation, the formula of which is:
wherein, W1 and W2 are words respectively, the similarity of the two words is Sim (W1, W2), the word distance is Dis (W1, W2), and α is the word distance value when the similarity is 0.5.
10. The apparatus of claim 6, wherein the universal thesaurus comprises all possible words in documents or database records in an electric power technology achievement resource base; the special word bank is constructed according to a professional technical term word list in the field of electric power science and technology.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010832606.7A CN112131394A (en) | 2020-08-18 | 2020-08-18 | Scientific and technological achievement keyword network construction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010832606.7A CN112131394A (en) | 2020-08-18 | 2020-08-18 | Scientific and technological achievement keyword network construction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112131394A true CN112131394A (en) | 2020-12-25 |
Family
ID=73850980
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010832606.7A Pending CN112131394A (en) | 2020-08-18 | 2020-08-18 | Scientific and technological achievement keyword network construction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112131394A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113641785A (en) * | 2021-06-28 | 2021-11-12 | 北京邮电大学 | Multi-dimension-based scientific and technological resource similar word retrieval method and electronic equipment |
CN114780673A (en) * | 2022-03-28 | 2022-07-22 | 西安远诺技术转移有限公司 | Scientific and technological achievement management method and scientific and technological achievement management platform based on field matching |
-
2020
- 2020-08-18 CN CN202010832606.7A patent/CN112131394A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113641785A (en) * | 2021-06-28 | 2021-11-12 | 北京邮电大学 | Multi-dimension-based scientific and technological resource similar word retrieval method and electronic equipment |
CN113641785B (en) * | 2021-06-28 | 2023-08-01 | 北京邮电大学 | Multi-dimensional technology resource similar word retrieval method and electronic equipment |
CN114780673A (en) * | 2022-03-28 | 2022-07-22 | 西安远诺技术转移有限公司 | Scientific and technological achievement management method and scientific and technological achievement management platform based on field matching |
CN114780673B (en) * | 2022-03-28 | 2024-04-30 | 西安远诺技术转移有限公司 | Scientific and technological achievement management method and platform based on field matching |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110069610B (en) | Solr-based retrieval method, solr-based retrieval device, solr-based retrieval equipment and storage medium | |
CN110019732B (en) | Intelligent question answering method and related device | |
CN102156711B (en) | Cloud storage based power full text retrieval method and system | |
CN104866572A (en) | Method for clustering network-based short texts | |
CN106708929B (en) | Video program searching method and device | |
CN107844493B (en) | File association method and system | |
CN107665217A (en) | A kind of vocabulary processing method and system for searching service | |
CN112988980B (en) | Target product query method and device, computer equipment and storage medium | |
US20220261545A1 (en) | Systems and methods for producing a semantic representation of a document | |
CN112507109A (en) | Retrieval method and device based on semantic analysis and keyword recognition | |
CN107301195A (en) | Generate disaggregated model method, device and the data handling system for searching for content | |
CN112131394A (en) | Scientific and technological achievement keyword network construction method and device | |
KR20140075428A (en) | Method and system for semantic search keyword recommendation | |
CN109815390B (en) | Method, device, computer equipment and computer storage medium for retrieving multilingual information | |
CN111078842A (en) | Method, device, server and storage medium for determining query result | |
CA2817136A1 (en) | Related-word registration and information processing device, method, recording medium and system | |
CN111611452A (en) | Method, system, device and storage medium for ambiguity recognition of search text | |
CN103927339B (en) | Knowledge Reorganizing system and method for knowledge realignment | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
GB2568575A (en) | Document search using grammatical units | |
CN104111942B (en) | Uighur medicine ancient books resource network searching platform | |
CN116108181A (en) | Client information processing method and device and electronic equipment | |
CN107368525B (en) | Method and device for searching related words, storage medium and terminal equipment | |
CN105512270A (en) | Method and device for determining related objects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201225 |
|
WD01 | Invention patent application deemed withdrawn after publication |