CN110928985A

CN110928985A - Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm

Info

Publication number: CN110928985A
Application number: CN201910972646.9A
Authority: CN
Inventors: 谢积鉴; 陈旭红; 粟月萍; 钟雪梅; 胡婷婷; 玉泉; 陈金平; 李�荣; 陈怡玲; 卢琳玲
Original assignee: Institute Of Scientific And Technical Information Of Guangxi Autonomous Region
Current assignee: Institute Of Scientific And Technical Information Of Guangxi Autonomous Region
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2020-03-27

Abstract

The invention relates to the technical field of data duplication checking, in particular to a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm, which comprises the following steps: establishing a near-synonym database and a project database, training a search term network and training a scientific and technological project network; acquiring information of the scientific and technological project to be compared, extracting search terms in the scientific and technological project information to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared; extracting a similar meaning word to replace a corresponding unidentifiable search word; and cascading the training search term network and the training scientific and technological project network, and screening out candidate scientific and technological projects with the similarity exceeding a similarity judgment threshold according to the similarity matching, so as to realize duplicate checking. The invention adopts a computer deep learning algorithm, and has high operation speed and high precision; the automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.

Description

Scientific and technological project duplicate checking method for automatically extracting near-meaning words based on deep learning algorithm

Technical Field

The invention relates to the technical field of data duplication checking, in particular to a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm.

Background

In order to promote scientific and technological innovation, the number of scientific research projects and the scale of expenses in China are remarkably improved, and a multi-level national scientific and technological plan subsidy system is formed. According to incomplete statistics, only the national science fund has about 4 thousands of items every year, the national social fund has about 4 thousands of items, and in addition, the scientific and technological plans and research and development problems at the national level, the ministry of commission level and the provincial level are difficult to count, and the foreign scientific and technological projects are countless.

However, the project multi-head declaration and repeated project establishment become one of the outstanding problems in the scientific research project management field. According to statistics, the repetition rate of scientific research projects in China reaches 40%, and the repetition rate of scientific research projects in China is about 30% or more, wherein the repetition rate of scientific research projects in China is about 60%. The repeated establishment not only causes great waste of scientific and technological resources, but also leads to disordered development and great low-level repetition of scientific research activities, seriously damages the scientific research spirit of developing innovation, has great harm to the development of scientific and technological innovation, and hinders the pace of national scientific and technological development. Therefore, how to establish an effective and feasible technology project duplication checking mechanism has become one of the important tasks of the technology project management department.

At present, the commonly used method is to manually review or screen repeatedly declared projects from a large number of reported projects by a duplication checking mode of simply comparing keywords of a scientific project declaration book with a project database. However, the method is difficult to avoid that the claimant changes synonyms in the title deliberately or slightly changes the content of the project declaration, so that the duplicate checking system can be avoided easily, the synonyms and the near synonyms are difficult to identify without specific manual analysis, and the reliability is poor.

In addition, a scientific and technological novelty retrieval report of a novelty retrieval organization is often used as a reference in the evaluation and review of scientific and technological projects, novelty creativity of the scientific and technological projects is judged by checking the content of a scientific and technological novelty retrieval point analyzed, the contents of the projects are deeper, but the qualification level difference of the scientific and technological novelty retrieval organization is large, the scientific and technological novelty retrieval report is written manually, personal subjective judgment exists, the quality of the report is greatly influenced by the quality level of the business of a novelty retrieval person, and objective and fair comparison results are difficult to guarantee.

Disclosure of Invention

The invention provides a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm.

The technical scheme of the invention is as follows: a scientific and technological project duplicate checking method for automatically extracting near meaning words based on a deep learning algorithm comprises the following steps:

step 1, collecting historical data, and establishing a near-meaning word database and a project database, wherein the near-meaning word database comprises a large number of near-meaning word groups, and each near-meaning word group stores words in the same language; establishing a training set for the search terms in the similar meaning term database, and training a search term network; training a scientific and technological project network by taking scientific and technological project information of a project database as a training set;

step 2, acquiring information of the scientific and technological project to be compared, extracting search terms in the information of the scientific and technological project to be compared, pre-comparing the search terms in the project database, and extracting the search terms which cannot be identified in the scientific and technological project to be compared;

step 3, searching a similar meaning word of each unidentified search word in a similar meaning word database respectively, and extracting the similar meaning word to replace the corresponding unidentified search word;

and 4, cascading the training search term network and the training scientific and technological project network, inputting the similar meaning terms into the training search term network, screening out candidate scientific and technological projects with similarity exceeding a similarity judgment threshold, determining the candidate scientific and technological projects as similar texts of the scientific and technological projects to be compared, and realizing duplication checking.

The invention adopts a computer deep learning algorithm, carries out self-learning training based on big data, and has high intelligent degree, high operation speed and high precision.

The automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.

Preferably, the retrieval words are extracted from the title or keyword fields of the scientific and technological project, so that the retrieval accuracy can be further improved, and quick hit is facilitated.

Preferably, the extraction of the search term in step 1 comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project. Word segmentation and denoising processing are adopted, so that the accuracy of retrieval word extraction is improved, and the reliability of duplicate checking is ensured.

Preferably, the step 2 further includes adding the unrecognized search word into the corresponding near-meaning phrase in the near-meaning word database, and continuously updating the item database and the near-meaning word database, so that the historical data is richer, the data content is more, the data volume of machine learning is ensured, and the realization of the deep learning algorithm is more facilitated.

Preferably, the definition method of the synonym comprises the following steps: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words. All possible near meaning words and synonyms are added into the near meaning phrase, so that the omission is further avoided, and the reliability of duplicate checking is improved.

The invention has the beneficial effects that:

1. the invention adopts a computer deep learning algorithm, carries out self-learning training based on big data, and has high intelligent degree, high operation speed and high precision.

2. The automatic extraction and retrieval of the similar meaning words are adopted, the retrieval words are more comprehensive, the omission is avoided, and the recall ratio is ensured.

3. The project database and the synonym database are continuously updated, so that the historical data are richer, the data content is more, the data volume of machine learning is ensured, and the running speed of the deep learning algorithm is improved.

Drawings

Fig. 1 is a work flow chart of a scientific and technological project duplication checking method for automatically extracting near-meaning words based on a deep learning algorithm.

Detailed Description

In order to make the aforementioned and other features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

As shown in fig. 1, a scientific and technological project duplication checking method for automatically extracting near meaning words based on a deep learning algorithm includes the following steps:

step 1, collecting historical data, establishing a near-meaning word database and a project database, establishing a training set for search words in the near-meaning word database, wherein the near-meaning word database comprises a plurality of near-meaning phrases, so that a training set file comprises a plurality of near-meaning phrase training sets, the number of the near-meaning phrase training sets is consistent with that of the near-meaning phrases, and then training a search word network; taking scientific and technological project information of a project database as a training set, establishing a training folder for each scientific and technological project, and training a scientific and technological project network;

and 4, cascading the training search word network and the training scientific and technological project network, associating a near meaning word with each scientific and technological project, inputting the near meaning word into the training search word network, adopting a SimHash algorithm, for example, to reduce the dimension of the text, generating a SimHash value to further generate the fingerprint mentioned in the scientific and technological project, comparing the Hamming distance by the SimHash values of different texts, and obtaining Hash character strings through SimHash calculation which are very similar to each other, so that the similarity degree of the information of the two scientific and technological projects can be judged. And screening out the candidate science and technology items with the similarity exceeding a similarity judgment threshold, and if the similarity is 80%, determining the candidate science and technology items as similar texts of the science and technology items to be compared, so as to realize duplicate checking. And then outputting a duplicate checking result for technical experts to refer to, and judging whether the scientific and technological project to be compared belongs to a repeated project.

As a preferred solution of this embodiment, the search term is extracted from a title or a keyword field of a scientific and technological project.

As a preferred scheme of this embodiment, the extraction of the search term in step 1 is calculated by using an euler distance algorithm, and the analysis of two terms that reflect the difference in the numerical value of the dimension is basically performed by performing similarity matching according to the spatial distance between two points, and the extracted near term is the near term with the smallest semantic distance value. The processing of the scientific and technological project information to be compared further comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project.

As a preferable solution of this embodiment, step 2 further includes adding an unrecognized search term to a corresponding synonym in the synonym database.

As a preferable solution of this embodiment, the method for defining the synonym includes: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words.

The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims

1. A scientific and technological project duplicate checking method for automatically extracting near meaning words based on a deep learning algorithm is characterized in that: the method comprises the following steps:

step 1, collecting historical data, establishing a similar meaning word database and a project database, establishing a training set for search words in the similar meaning word database, and training a search word network; training a scientific and technological project network by taking scientific and technological project information of a project database as a training set;

2. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: the search term is extracted from the title or keyword field of the scientific and technical project.

3. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: the extraction of the search terms in the step 1 comprises the following steps: a. performing word segmentation processing on each piece of scientific and technological project information, and segmenting the scientific and technological project information into a plurality of keywords; b. and (4) stopping word filtering processing, removing punctuation marks, special characters, repeated elements and strain filtering words in the keyword array, and finally obtaining the search words of the science and technology project.

4. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 1, characterized in that: and step 2, adding the unrecognized search word into the corresponding similar phrase in the similar phrase database.

5. The scientific and technological project duplication checking method for automatically extracting near meaning words based on the deep learning algorithm as claimed in claim 2, characterized in that: the definition method of the similar meaning words in the step 2 comprises the following steps: setting words with similar semantics as similar meaning words; setting different tenses and single-plural numbers of the same English word as similar meaning words; setting the case of the same English word as a similar meaning word; and setting short names, alias names and the names of the same words as similar words.