CN110928985A - Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm - Google Patents

Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm Download PDF

Info

Publication number
CN110928985A
CN110928985A CN201910972646.9A CN201910972646A CN110928985A CN 110928985 A CN110928985 A CN 110928985A CN 201910972646 A CN201910972646 A CN 201910972646A CN 110928985 A CN110928985 A CN 110928985A
Authority
CN
China
Prior art keywords
scientific
project
technological
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910972646.9A
Other languages
Chinese (zh)
Inventor
谢积鉴
陈旭红
粟月萍
钟雪梅
胡婷婷
玉泉
陈金平
李�荣
陈怡玲
卢琳玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Scientific And Technical Information Of Guangxi Autonomous Region
Original Assignee
Institute Of Scientific And Technical Information Of Guangxi Autonomous Region
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Scientific And Technical Information Of Guangxi Autonomous Region filed Critical Institute Of Scientific And Technical Information Of Guangxi Autonomous Region
Priority to CN201910972646.9A priority Critical patent/CN110928985A/en
Publication of CN110928985A publication Critical patent/CN110928985A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data duplication checking, in particular to a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm, which comprises the following steps: establishing a near-synonym database and a project database, training a search term network and training a scientific and technological project network; acquiring information of the scientific and technological project to be compared, extracting search terms in the scientific and technological project information to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared; extracting a similar meaning word to replace a corresponding unidentifiable search word; and cascading the training search term network and the training scientific and technological project network, and screening out candidate scientific and technological projects with the similarity exceeding a similarity judgment threshold according to the similarity matching, so as to realize duplicate checking. The invention adopts a computer deep learning algorithm, and has high operation speed and high precision; the automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.

Description

Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm
Technical Field
The invention relates to the technical field of data duplication checking, in particular to a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm.
Background
In order to promote scientific and technological innovation, the number of scientific research projects and the scale of expenses in China are remarkably improved, and a multi-level national scientific and technological plan subsidy system is formed. According to incomplete statistics, only the national science fund has about 4 thousands of items every year, the national social fund has about 4 thousands of items, and in addition, the scientific and technological plans and research and development problems at the national level, the ministry of commission level and the provincial level are difficult to count, and the foreign scientific and technological projects are countless.
However, the project multi-head declaration and repeated project establishment become one of the outstanding problems in the scientific research project management field. According to statistics, the repetition rate of scientific research projects in China reaches 40%, and the repetition rate of scientific research projects in China is about 30% or more, wherein the repetition rate of scientific research projects in China is about 60%. The repeated establishment not only causes great waste of scientific and technological resources, but also leads to disordered development and great low-level repetition of scientific research activities, seriously damages the scientific research spirit of developing innovation, has great harm to the development of scientific and technological innovation, and hinders the pace of national scientific and technological development. Therefore, how to establish an effective and feasible technology project duplication checking mechanism has become one of the important tasks of the technology project management department.
At present, the commonly used method is to manually review or screen repeatedly declared projects from a large number of reported projects by a duplication checking mode of simply comparing keywords of a scientific project declaration book with a project database. However, the method is difficult to avoid that the claimant changes synonyms in the title deliberately or slightly changes the content of the project declaration, so that the duplicate checking system can be avoided easily, the synonyms and the near synonyms are difficult to identify without specific manual analysis, and the reliability is poor.
In addition, a scientific and technological novelty retrieval report of a novelty retrieval organization is often used as a reference in the evaluation and review of scientific and technological projects, novelty creativity of the scientific and technological projects is judged by checking the content of a scientific and technological novelty retrieval point analyzed, the contents of the projects are deeper, but the qualification level difference of the scientific and technological novelty retrieval organization is large, the scientific and technological novelty retrieval report is written manually, personal subjective judgment exists, the quality of the report is greatly influenced by the quality level of the business of a novelty retrieval person, and objective and fair comparison results are difficult to guarantee.
Disclosure of Invention
The invention provides a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm.
The technical scheme of the invention is as follows: a scientific and technological project duplicate checking method for automatically extracting near meaning words based on a deep learning algorithm comprises the following steps:
step 1, collecting historical data, and establishing a near-meaning word database and a project database, wherein the near-meaning word database comprises a large number of near-meaning word groups, and each near-meaning word group stores words in the same language; establishing a training set for the search terms in the similar meaning term database, and training a search term network; training a scientific and technological project network by taking scientific and technological project information of a project database as a training set;
step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;
step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;
and 4, cascading the training search term network and the training scientific and technological project network, inputting the similar meaning terms into the training search term network, screening out candidate scientific and technological projects with similarity exceeding a similarity judgment threshold, determining the candidate scientific and technological projects as similar texts of the scientific and technological projects to be compared, and realizing duplication checking.
The invention adopts a computer deep learning algorithm, carries out self-learning training based on big data, and has high intelligent degree, high operation speed and high precision.
The automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.
Preferably, the retrieval words are extracted from the title or keyword fields of the scientific and technological project, so that the retrieval accuracy can be further improved, and quick hit is facilitated.
Preferably, the extraction of the search term in step 1 comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project. Word segmentation and denoising processing are adopted, so that the accuracy of retrieval word extraction is improved, and the reliability of duplicate checking is ensured.
Preferably, the step 2 further includes adding the unrecognized search word into the corresponding near-meaning phrase in the near-meaning word database, and continuously updating the item database and the near-meaning word database, so that the historical data is richer, the data content is more, the data volume of machine learning is ensured, and the realization of the deep learning algorithm is more facilitated.
Preferably, the definition method of the synonym comprises the following steps: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words. All possible near meaning words and synonyms are added into the near meaning phrase, so that the omission is further avoided, and the reliability of duplicate checking is improved.
The invention has the beneficial effects that:
1. the invention adopts a computer deep learning algorithm, carries out self-learning training based on big data, and has high intelligent degree, high operation speed and high precision.
2. The automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.
3. The project database and the synonym database are continuously updated, so that the historical data are richer, the data content is more, the data volume of machine learning is ensured, and the running speed of the deep learning algorithm is improved.
Drawings
Fig. 1 is a work flow chart of a scientific and technological project duplication checking method for automatically extracting near-meaning words based on a deep learning algorithm.
Detailed Description
In order to make the aforementioned and other features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm includes the following steps:
step 1, collecting historical data, establishing a near-meaning word database and a project database, establishing a training set for search words in the near-meaning word database, wherein the near-meaning word database comprises a plurality of near-meaning phrases, so that a training set file comprises a plurality of near-meaning phrase training sets, the number of the near-meaning phrase training sets is consistent with that of the near-meaning phrases, and then training a search word network; taking scientific and technological project information of a project database as a training set, establishing a training folder for each scientific and technological project, and training a scientific and technological project network;
step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;
step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;
and 4, cascading the training search word network and the training scientific and technological project network, associating a near meaning word with each scientific and technological project, inputting the near meaning word into the training search word network, adopting a SimHash algorithm, for example, to reduce the dimension of the text, generating a SimHash value to further generate the fingerprint mentioned in the scientific and technological project, comparing the Hamming distance by the SimHash values of different texts, and obtaining Hash character strings through SimHash calculation which are very similar to each other, so that the similarity degree of the information of the two scientific and technological projects can be judged. And screening out the candidate science and technology items with the similarity exceeding a similarity judgment threshold, and if the similarity is 80%, determining the candidate science and technology items as similar texts of the science and technology items to be compared, so as to realize duplicate checking. And then outputting a duplicate checking result for technical experts to refer to, and judging whether the scientific and technological project to be compared belongs to a repeated project.
As a preferred solution of this embodiment, the search term is extracted from a title or a keyword field of a scientific and technological project.
As a preferred scheme of this embodiment, the extraction of the search term in step 1 is calculated by using an euler distance algorithm, and the analysis of two terms that reflect the difference in the numerical value of the dimension is basically performed by performing similarity matching according to the spatial distance between two points, and the extracted near term is the near term with the smallest semantic distance value. The processing of the scientific and technological project information to be compared further comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project.
As a preferable solution of this embodiment, step 2 further includes adding an unrecognized search term to a corresponding synonym in the synonym database.
As a preferable solution of this embodiment, the method for defining the synonym includes: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words.
The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims (5)

1. A scientific and technological project duplicate checking method for automatically extracting near meaning words based on a deep learning algorithm is characterized in that: the method comprises the following steps:
step 1, collecting historical data, establishing a similar meaning word database and a project database, establishing a training set for search words in the similar meaning word database, and training a search word network; training a scientific and technological project network by taking scientific and technological project information of a project database as a training set;
step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;
step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;
and 4, cascading the training search term network and the training scientific and technological project network, inputting the similar meaning terms into the training search term network, screening out candidate scientific and technological projects with similarity exceeding a similarity judgment threshold, determining the candidate scientific and technological projects as similar texts of the scientific and technological projects to be compared, and realizing duplication checking.
2. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: the search term is extracted from the title or keyword field of the scientific and technical project.
3. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: the extraction of the search terms in the step 1 comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project.
4. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: and step 2, adding the unrecognized search word into the corresponding similar phrase in the similar phrase database.
5. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 2, characterized in that: the definition method of the similar meaning words in the step 2 comprises the following steps: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words.
CN201910972646.9A 2019-10-14 2019-10-14 Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm Pending CN110928985A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910972646.9A CN110928985A (en) 2019-10-14 2019-10-14 Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910972646.9A CN110928985A (en) 2019-10-14 2019-10-14 Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm

Publications (1)

Publication Number Publication Date
CN110928985A true CN110928985A (en) 2020-03-27

Family

ID=69848928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910972646.9A Pending CN110928985A (en) 2019-10-14 2019-10-14 Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm

Country Status (1)

Country Link
CN (1) CN110928985A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199936A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Intelligent analysis method and storage medium for repeated declaration of scientific research projects

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761222A (en) * 2013-12-31 2014-04-30 上海兵飞软件有限公司 Semantic-analysis-algorithm pseudo-original identification method
CN105446954A (en) * 2015-11-18 2016-03-30 广东省科技基础条件平台中心 Project duplicate checking method for science and technology big data
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN108829663A (en) * 2018-05-21 2018-11-16 宁波薄言信息技术有限公司 A kind of article appraisal procedure and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761222A (en) * 2013-12-31 2014-04-30 上海兵飞软件有限公司 Semantic-analysis-algorithm pseudo-original identification method
CN105446954A (en) * 2015-11-18 2016-03-30 广东省科技基础条件平台中心 Project duplicate checking method for science and technology big data
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN108829663A (en) * 2018-05-21 2018-11-16 宁波薄言信息技术有限公司 A kind of article appraisal procedure and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199936A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Intelligent analysis method and storage medium for repeated declaration of scientific research projects
CN112199936B (en) * 2020-11-12 2024-01-23 深圳供电局有限公司 Intelligent analysis method and storage medium for repeated declaration of scientific research projects

Similar Documents

Publication Publication Date Title
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
CN110968699B (en) Logic map construction and early warning method and device based on fact recommendation
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
US10163063B2 (en) Automatically mining patterns for rule based data standardization systems
US20180253416A1 (en) Automatic Human-emulative Document Analysis Enhancements
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN113806563A (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN114153978A (en) Model training method, information extraction method, device, equipment and storage medium
Owen et al. Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections.
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
CN112380848B (en) Text generation method, device, equipment and storage medium
CN113806483A (en) Data processing method and device, electronic equipment and computer program product
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN117216275A (en) Text processing method, device, equipment and storage medium
CN110928985A (en) Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm
Tran et al. Context-aware detection of sneaky vandalism on wikipedia across multiple languages
Sreejith et al. N-gram based algorithm for distinguishing between Hindi and Sanskrit texts
CN110807096A (en) Information pair matching method and system on small sample set
Suhasini et al. A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data
Maheswari et al. Rule based morphological variation removable stemming algorithm
CN114265931A (en) Big data text mining-based consumer policy perception analysis method and system
CN110472243B (en) Chinese spelling checking method
CN113221538A (en) Event library construction method and device, electronic equipment and computer readable medium
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200327

RJ01 Rejection of invention patent application after publication