CN110928985A - Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm - Google Patents
Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm Download PDFInfo
- Publication number
- CN110928985A CN110928985A CN201910972646.9A CN201910972646A CN110928985A CN 110928985 A CN110928985 A CN 110928985A CN 201910972646 A CN201910972646 A CN 201910972646A CN 110928985 A CN110928985 A CN 110928985A
- Authority
- CN
- China
- Prior art keywords
- scientific
- project
- technological
- word
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000013135 deep learning Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 31
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 4
- 238000005516 engineering process Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000011160 research Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data duplication checking, in particular to a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm, which comprises the following steps: establishing a near-synonym database and a project database, training a search term network and training a scientific and technological project network; acquiring information of the scientific and technological project to be compared, extracting search terms in the scientific and technological project information to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared; extracting a similar meaning word to replace a corresponding unidentifiable search word; and cascading the training search term network and the training scientific and technological project network, and screening out candidate scientific and technological projects with the similarity exceeding a similarity judgment threshold according to the similarity matching, so as to realize duplicate checking. The invention adopts a computer deep learning algorithm, and has high operation speed and high precision; the automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.
Description
Technical Field
The invention relates to the technical field of data duplication checking, in particular to a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm.
Background
In order to promote scientific and technological innovation, the number of scientific research projects and the scale of expenses in China are remarkably improved, and a multi-level national scientific and technological plan subsidy system is formed. According to incomplete statistics, only the national science fund has about 4 thousands of items every year, the national social fund has about 4 thousands of items, and in addition, the scientific and technological plans and research and development problems at the national level, the ministry of commission level and the provincial level are difficult to count, and the foreign scientific and technological projects are countless.
However, the project multi-head declaration and repeated project establishment become one of the outstanding problems in the scientific research project management field. According to statistics, the repetition rate of scientific research projects in China reaches 40%, and the repetition rate of scientific research projects in China is about 30% or more, wherein the repetition rate of scientific research projects in China is about 60%. The repeated establishment not only causes great waste of scientific and technological resources, but also leads to disordered development and great low-level repetition of scientific research activities, seriously damages the scientific research spirit of developing innovation, has great harm to the development of scientific and technological innovation, and hinders the pace of national scientific and technological development. Therefore, how to establish an effective and feasible technology project duplication checking mechanism has become one of the important tasks of the technology project management department.
At present, the commonly used method is to manually review or screen repeatedly declared projects from a large number of reported projects by a duplication checking mode of simply comparing keywords of a scientific project declaration book with a project database. However, the method is difficult to avoid that the claimant changes synonyms in the title deliberately or slightly changes the content of the project declaration, so that the duplicate checking system can be avoided easily, the synonyms and the near synonyms are difficult to identify without specific manual analysis, and the reliability is poor.
In addition, a scientific and technological novelty retrieval report of a novelty retrieval organization is often used as a reference in the evaluation and review of scientific and technological projects, novelty creativity of the scientific and technological projects is judged by checking the content of a scientific and technological novelty retrieval point analyzed, the contents of the projects are deeper, but the qualification level difference of the scientific and technological novelty retrieval organization is large, the scientific and technological novelty retrieval report is written manually, personal subjective judgment exists, the quality of the report is greatly influenced by the quality level of the business of a novelty retrieval person, and objective and fair comparison results are difficult to guarantee.
Disclosure of Invention
The invention provides a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm.
The technical scheme of the invention is as follows: a scientific and technological project duplicate checking method for automatically extracting near meaning words based on a deep learning algorithm comprises the following steps:
step 1, collecting historical data, and establishing a near-meaning word database and a project database, wherein the near-meaning word database comprises a large number of near-meaning word groups, and each near-meaning word group stores words in the same language; establishing a training set for the search terms in the similar meaning term database, and training a search term network; training a scientific and technological project network by taking scientific and technological project information of a project database as a training set;
step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;
step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;
and 4, cascading the training search term network and the training scientific and technological project network, inputting the similar meaning terms into the training search term network, screening out candidate scientific and technological projects with similarity exceeding a similarity judgment threshold, determining the candidate scientific and technological projects as similar texts of the scientific and technological projects to be compared, and realizing duplication checking.
The invention adopts a computer deep learning algorithm, carries out self-learning training based on big data, and has high intelligent degree, high operation speed and high precision.
The automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.
Preferably, the retrieval words are extracted from the title or keyword fields of the scientific and technological project, so that the retrieval accuracy can be further improved, and quick hit is facilitated.
Preferably, the extraction of the search term in step 1 comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project. Word segmentation and denoising processing are adopted, so that the accuracy of retrieval word extraction is improved, and the reliability of duplicate checking is ensured.
Preferably, the step 2 further includes adding the unrecognized search word into the corresponding near-meaning phrase in the near-meaning word database, and continuously updating the item database and the near-meaning word database, so that the historical data is richer, the data content is more, the data volume of machine learning is ensured, and the realization of the deep learning algorithm is more facilitated.
Preferably, the definition method of the synonym comprises the following steps: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words. All possible near meaning words and synonyms are added into the near meaning phrase, so that the omission is further avoided, and the reliability of duplicate checking is improved.
The invention has the beneficial effects that:
1. the invention adopts a computer deep learning algorithm, carries out self-learning training based on big data, and has high intelligent degree, high operation speed and high precision.
2. The automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.
3. The project database and the synonym database are continuously updated, so that the historical data are richer, the data content is more, the data volume of machine learning is ensured, and the running speed of the deep learning algorithm is improved.
Drawings
Fig. 1 is a work flow chart of a scientific and technological project duplication checking method for automatically extracting near-meaning words based on a deep learning algorithm.
Detailed Description
In order to make the aforementioned and other features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm includes the following steps:
step 1, collecting historical data, establishing a near-meaning word database and a project database, establishing a training set for search words in the near-meaning word database, wherein the near-meaning word database comprises a plurality of near-meaning phrases, so that a training set file comprises a plurality of near-meaning phrase training sets, the number of the near-meaning phrase training sets is consistent with that of the near-meaning phrases, and then training a search word network; taking scientific and technological project information of a project database as a training set, establishing a training folder for each scientific and technological project, and training a scientific and technological project network;
step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;
step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;
and 4, cascading the training search word network and the training scientific and technological project network, associating a near meaning word with each scientific and technological project, inputting the near meaning word into the training search word network, adopting a SimHash algorithm, for example, to reduce the dimension of the text, generating a SimHash value to further generate the fingerprint mentioned in the scientific and technological project, comparing the Hamming distance by the SimHash values of different texts, and obtaining Hash character strings through SimHash calculation which are very similar to each other, so that the similarity degree of the information of the two scientific and technological projects can be judged. And screening out the candidate science and technology items with the similarity exceeding a similarity judgment threshold, and if the similarity is 80%, determining the candidate science and technology items as similar texts of the science and technology items to be compared, so as to realize duplicate checking. And then outputting a duplicate checking result for technical experts to refer to, and judging whether the scientific and technological project to be compared belongs to a repeated project.
As a preferred solution of this embodiment, the search term is extracted from a title or a keyword field of a scientific and technological project.
As a preferred scheme of this embodiment, the extraction of the search term in step 1 is calculated by using an euler distance algorithm, and the analysis of two terms that reflect the difference in the numerical value of the dimension is basically performed by performing similarity matching according to the spatial distance between two points, and the extracted near term is the near term with the smallest semantic distance value. The processing of the scientific and technological project information to be compared further comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project.
As a preferable solution of this embodiment, step 2 further includes adding an unrecognized search term to a corresponding synonym in the synonym database.
As a preferable solution of this embodiment, the method for defining the synonym includes: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words.
The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.
Claims (5)
1. A scientific and technological project duplicate checking method for automatically extracting near meaning words based on a deep learning algorithm is characterized in that: the method comprises the following steps:
step 1, collecting historical data, establishing a similar meaning word database and a project database, establishing a training set for search words in the similar meaning word database, and training a search word network; training a scientific and technological project network by taking scientific and technological project information of a project database as a training set;
step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;
step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;
and 4, cascading the training search term network and the training scientific and technological project network, inputting the similar meaning terms into the training search term network, screening out candidate scientific and technological projects with similarity exceeding a similarity judgment threshold, determining the candidate scientific and technological projects as similar texts of the scientific and technological projects to be compared, and realizing duplication checking.
2. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: the search term is extracted from the title or keyword field of the scientific and technical project.
3. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: the extraction of the search terms in the step 1 comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project.
4. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: and step 2, adding the unrecognized search word into the corresponding similar phrase in the similar phrase database.
5. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 2, characterized in that: the definition method of the similar meaning words in the step 2 comprises the following steps: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910972646.9A CN110928985A (en) | 2019-10-14 | 2019-10-14 | Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910972646.9A CN110928985A (en) | 2019-10-14 | 2019-10-14 | Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110928985A true CN110928985A (en) | 2020-03-27 |
Family
ID=69848928
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910972646.9A Pending CN110928985A (en) | 2019-10-14 | 2019-10-14 | Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110928985A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199936A (en) * | 2020-11-12 | 2021-01-08 | 深圳供电局有限公司 | Intelligent analysis method and storage medium for repeated declaration of scientific research projects |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761222A (en) * | 2013-12-31 | 2014-04-30 | 上海兵飞软件有限公司 | Semantic-analysis-algorithm pseudo-original identification method |
CN105446954A (en) * | 2015-11-18 | 2016-03-30 | 广东省科技基础条件平台中心 | Project duplicate checking method for science and technology big data |
CN107122340A (en) * | 2017-03-30 | 2017-09-01 | 浙江省科技信息研究院 | A kind of similarity detection method for the science and technology item return analyzed based on synonym |
CN108829663A (en) * | 2018-05-21 | 2018-11-16 | 宁波薄言信息技术有限公司 | A kind of article appraisal procedure and system |
-
2019
- 2019-10-14 CN CN201910972646.9A patent/CN110928985A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761222A (en) * | 2013-12-31 | 2014-04-30 | 上海兵飞软件有限公司 | Semantic-analysis-algorithm pseudo-original identification method |
CN105446954A (en) * | 2015-11-18 | 2016-03-30 | 广东省科技基础条件平台中心 | Project duplicate checking method for science and technology big data |
CN107122340A (en) * | 2017-03-30 | 2017-09-01 | 浙江省科技信息研究院 | A kind of similarity detection method for the science and technology item return analyzed based on synonym |
CN108829663A (en) * | 2018-05-21 | 2018-11-16 | 宁波薄言信息技术有限公司 | A kind of article appraisal procedure and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199936A (en) * | 2020-11-12 | 2021-01-08 | 深圳供电局有限公司 | Intelligent analysis method and storage medium for repeated declaration of scientific research projects |
CN112199936B (en) * | 2020-11-12 | 2024-01-23 | 深圳供电局有限公司 | Intelligent analysis method and storage medium for repeated declaration of scientific research projects |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111723215B (en) | Device and method for establishing biotechnological information knowledge graph based on text mining | |
AU2019263758B2 (en) | Systems and methods for generating a contextually and conversationally correct response to a query | |
CN110968699B (en) | Logic map construction and early warning method and device based on fact recommendation | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
US10163063B2 (en) | Automatically mining patterns for rule based data standardization systems | |
US20180253416A1 (en) | Automatic Human-emulative Document Analysis Enhancements | |
CN106933800A (en) | A kind of event sentence abstracting method of financial field | |
CN113806563A (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN114153978A (en) | Model training method, information extraction method, device, equipment and storage medium | |
Owen et al. | Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections. | |
CN112597768B (en) | Text auditing method, device, electronic equipment, storage medium and program product | |
CN112380848B (en) | Text generation method, device, equipment and storage medium | |
CN113806483A (en) | Data processing method and device, electronic equipment and computer program product | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN117216275A (en) | Text processing method, device, equipment and storage medium | |
CN110928985A (en) | Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm | |
Tran et al. | Context-aware detection of sneaky vandalism on wikipedia across multiple languages | |
Sreejith et al. | N-gram based algorithm for distinguishing between Hindi and Sanskrit texts | |
CN110807096A (en) | Information pair matching method and system on small sample set | |
Suhasini et al. | A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data | |
Maheswari et al. | Rule based morphological variation removable stemming algorithm | |
CN114265931A (en) | Big data text mining-based consumer policy perception analysis method and system | |
CN110472243B (en) | Chinese spelling checking method | |
CN113221538A (en) | Event library construction method and device, electronic equipment and computer readable medium | |
CN109597879B (en) | Service behavior relation extraction method and device based on 'citation relation' data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200327 |
|
RJ01 | Rejection of invention patent application after publication |