CN112199938A - Scientific and technological project similarity analysis method, computer equipment and storage medium - Google Patents

Scientific and technological project similarity analysis method, computer equipment and storage medium Download PDF

Info

Publication number
CN112199938A
CN112199938A CN202011258083.6A CN202011258083A CN112199938A CN 112199938 A CN112199938 A CN 112199938A CN 202011258083 A CN202011258083 A CN 202011258083A CN 112199938 A CN112199938 A CN 112199938A
Authority
CN
China
Prior art keywords
project
historical
evaluated
information
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011258083.6A
Other languages
Chinese (zh)
Other versions
CN112199938B (en
Inventor
汪桢子
章彬
何维
汪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Power Supply Bureau Co Ltd
Original Assignee
Shenzhen Power Supply Bureau Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Power Supply Bureau Co Ltd filed Critical Shenzhen Power Supply Bureau Co Ltd
Priority to CN202011258083.6A priority Critical patent/CN112199938B/en
Publication of CN112199938A publication Critical patent/CN112199938A/en
Application granted granted Critical
Publication of CN112199938B publication Critical patent/CN112199938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a scientific and technological project similarity analysis method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring an electronic document of a declared material of a project to be evaluated, and extracting a text of the electronic document to obtain title information to be evaluated of the project to be evaluated; acquiring an electronic document of historical evaluation project declaration materials, and performing text extraction on the electronic document to obtain historical title information of a historical evaluation project; carrying out short text similarity analysis according to the to-be-evaluated title information and the historical title information, and preliminarily judging whether the to-be-evaluated title information and the historical title information are similar according to an analysis result; and if so, performing text extraction on the electronic documents of the project to be evaluated and the historical project to obtain long text information to be evaluated and historical long text information, performing long text similarity analysis and final similarity judgment, and if not, performing circulation or ending. The method is suitable for text similarity analysis of science and technology project declaration materials in the field of electric power professions, is beneficial to realizing intelligent auxiliary establishment review and avoids repeated establishment.

Description

Scientific and technological project similarity analysis method, computer equipment and storage medium
Technical Field
The invention relates to the technical field of software information, in particular to a scientific and technological project similarity analysis method, computer equipment and a storage medium.
Background
With the continuous deep electric power reform and the continuous development of scientific technology, scientific and technical research projects in various professional fields of power grid companies are more and more subjected to item review, and in order to avoid repeated declaration of similar projects, similarity review needs to be performed on declaration materials of the scientific and technical research projects. Generally speaking, science and technology project declaration materials are large texts, at present, a science and technology project similarity judgment method needs to depend on professional manual reading and discrimination comparison, and for each science and technology project declaration material, the science and technology project declaration material needs to be manually compared with a large amount of prior science and technology project declaration materials in a database, so that a large amount of labor and time cost is consumed, and the high efficiency and accuracy of similarity judgment are difficult to guarantee. With the enhancement of environmental awareness, the power grid company carries out paperless office work at present, scientific and technological project declaration materials are submitted and reviewed in an electronic document mode, the electronic document provides a basis for the informatization of review work, whether repeated declaration conditions exist can be judged by analyzing the text similarity of the project to be reviewed and the historical review project, the current text similarity analysis mainly comprises word segmentation and distance calculation between words after word segmentation, and finally a similarity result is obtained comprehensively.
However, the current text similarity analysis method is not suitable for scientific and technical research project establishment review in each professional field of the power grid company, and the main reasons are as follows:
(1) because the major words in the title are more and all appear as long words combined, the major words are not purely segmentable, such as 'research and application of a device visualization monitoring model based on big data accelerated analysis and three-dimensional digitization', wherein the 'big data accelerated analysis', 'device visualization detection model' is simply segmented into 'big data', 'accelerated', 'analysis', 'device', 'visualization', 'detection', 'model', and the meaning has changed;
(2) semantic understanding is less effective for professional names. Such as: the similarity of the key technology and the development mode research of the source end base comprehensive energy system and the research of the comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technology on semantic understanding can be relatively high, but actually, the two scientific and technological projects are greatly different;
(3) the title of the scientific and technical project is relatively short, about 30 words are long, and only 10 words are short. Since science and technology project titles contain a large number of professional names, and the professional names are often combined together to form longer words containing semantics, for two project titles, if there are more repeated such terms in the two names, the likelihood that the two projects are similar is very high. But if direct edit distances are used for calculation, the similarity may be very low.
(4) The scientific and technological project target is a short text, and the contents of project abstract, main research contents, technical routes, expected targets and other parts in the declaration material of the scientific and technological project are long texts and are composed of a plurality of sentences, and the upper sentence and the lower sentence are mostly in mutual relation, so that the text comparison of the declaration material of a scientific and technological project cannot be simply processed by using a text comparison method, and the existing text processing does not consider the point.
Disclosure of Invention
The invention aims to provide a scientific and technological project similarity analysis method, computer equipment and a computer readable storage medium, which are suitable for text similarity analysis of scientific and technological project declaration materials in the various professional fields of electric power, are beneficial to realizing intelligent auxiliary establishment review, avoid repeated establishment and guarantee the quality improvement and efficiency improvement of establishment management work.
To achieve the above objective, according to a first aspect, an embodiment of the present invention provides a method for analyzing similarity of scientific and technological projects, including:
s1, obtaining an electronic document of the declared material of the project to be evaluated, and extracting the text of the electronic document to obtain the title information of the project to be evaluated;
step S2, obtaining an ith historical review project declaration material electronic document, and performing text extraction on the ith historical review project declaration material electronic document to obtain historical title information of the ith historical review project;
step S3, carrying out short text similarity analysis according to the information of the subject to be evaluated and the historical title information of the ith historical evaluation project, and preliminarily judging whether the information of the subject to be evaluated and the historical title information of the ith historical evaluation project are similar according to the analysis result; if yes, sequentially executing steps S4-S5, otherwise executing step S6; wherein the initial value of i is 1;
step S4, performing text extraction on the electronic document of the declaration material of the project to be evaluated to obtain long text information to be evaluated of the project to be evaluated, and performing text extraction on the electronic document of the declaration material of the ith historical project to obtain the long text information of the historical project;
step S5, according to the long text information to be evaluated and the historical long text information of the ith historical evaluation project, carrying out long text similarity analysis, and finally judging whether the two are similar according to the analysis result;
step S6, judging whether i is less than N; if yes, making i equal to i +1, and returning to the step S2; if not, outputting the similar judgment results between the project to be evaluated and all the historical evaluation projects to a display unit for displaying, and ending the analysis process; wherein M is a preset number; where N is the total number of historical review items.
Optionally, the step S31 includes:
step S31, obtaining the longest continuous common substring between the to-be-evaluated subject information and the historical title information of the ith historical evaluation project, and removing the longest continuous common substring from the to-be-evaluated subject information and the historical title information of the ith historical evaluation project respectively to obtain a first character string and a second character string;
step S32, calculating the edit distance between the first character string and the second character string;
step S33, calculating the similarity between the title information to be reviewed and the historical title information of the ith historical review project according to the editing distance;
and step S34, judging whether the information to be evaluated and the historical title information of the ith historical evaluation project are similar or not according to the comparison result of the similarity of the information to be evaluated and the historical title information of the ith historical evaluation project and a first similarity threshold value.
Optionally, the step S31 includes:
step S311, setting the subject information to be evaluated as a character string S1The historical title information of the ith historical review project is a character string si
Step S312, finding character string S1And siLongest continuous common substring sz
Step S313, if the longest continuous common substring SzIs greater than 2, the character string s is respectively connected1And siS inzAfter removal, a new 2 character string s is obtained10And si0And order s1=s10,si=si0Then returning to step S312; if the longest consecutive common substring szIs less than or equal to 2, s is output10As a first string, si0As a second string.
Optionally, the calculating the similarity between the to-be-reviewed title information and the historical title information of the ith historical review project according to the edit distance includes:
Figure BDA0002773726760000041
wherein s is10Representing a first string, si0Representing a second string, sim(s)10,si0) Calculating the similarity between the title information to be reviewed and the historical title information of the ith historical review project according to the editing distance, ED represents the editing distance between the first character string and the second character string, len(s)10) Indicates the length of the first string, len(s)i0) Indicating the length of the second string.
Optionally, the information of the to-be-evaluated subject includes a project main title of the to-be-evaluated item and a subtitle in research content; the historical title information of the ith historical review project comprises a project main title of the ith historical review project and a subtitle in research content;
the step S31 specifically includes: obtaining the longest continuous common substring between each title information in the to-be-evaluated title information and each title information in the historical title information of the ith historical evaluation project, and respectively removing the longest continuous common substrings to obtain a first character string sjk1And a second character string sjk2(ii) a Wherein s isjk1Showing a first character string, s, obtained by removing the jth title information in the to-be-evaluated title information and the kth title information in the historical title information after removing the maximum continuous common substringjk2Representing a second character string obtained after removing the largest continuous common substring of the kth title information in the historical title information and the jth title information in the to-be-evaluated title information;
the step S32 specifically includes: calculating all the first strings sjk1And a second character string s corresponding theretojk2The editing distance between the two groups is obtained to obtain an editing distance set; each title information in the to-be-evaluated title information has k corresponding editing distances;
the step S33 specifically includes: calculating all first character strings s according to the edit distance setjk1And a second character string s corresponding theretojk2Calculating the similarity between the information of the title to be evaluated and the information of the historical title of the ith historical evaluation project according to all the similarity calculation results; and each title information in the to-be-evaluated title information has corresponding k similarity calculation results.
Optionally, the outputting the similar judgment results between the to-be-evaluated item and all the historical evaluation items to a display unit for displaying includes:
if at least one historical review project is similar to the project to be reviewed, outputting the declaration material electronic document of the at least one historical review project to a display unit;
if at least one historical evaluation project is similar to the to-be-evaluated project, sorting the similarity of the to-be-evaluated project and all the historical evaluation projects, and then selecting the declaration material electronic documents of the M historical evaluation projects with the highest similarity to output to a display unit for displaying; m is a preset number.
Optionally, the step S5 includes:
step S51, inputting pre-trained Doc2vec models respectively according to the long text information to be evaluated and the historical long text information of the ith historical evaluation project, and outputting corresponding paragraph vectors to be evaluated and the historical paragraph vectors of the ith historical evaluation project;
step S52, calculating the similarity between the paragraph vector to be reviewed and the historical paragraph vector of the ith historical review project;
and step S53, judging whether the segment vector to be evaluated and the historical segment vector of the ith historical evaluation project are similar or not according to the comparison result of the similarity of the segment vector to be evaluated and the historical segment vector of the ith historical evaluation project and a second similarity threshold.
Optionally, the step S1 further includes:
text extraction is carried out on the electronic document of the declaration material of the project to be evaluated to obtain project technical field information of the project to be evaluated;
the obtaining of the ith electronic document of history review project declaration material in step S2 specifically includes:
acquiring an ith historical review project declaration material electronic document in a database corresponding to the project technical field according to the project technical field information of the project to be reviewed;
wherein all the historical review items in the step S6 are all the historical review items in the database of the corresponding project technology field.
According to a third aspect, an embodiment of the present invention further provides a computer device, including: according to the scientific and technological project similarity analysis system; or a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps according to the science and technology project similarity analysis method.
According to a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the scientific and technical project similarity analysis method.
The embodiment of the invention provides a scientific and technological project similarity analysis method and system, computer equipment and a computer readable storage medium, wherein the title information of declaration material electronic documents of a project to be evaluated and a historical evaluation project is extracted, and the similarity of the extracted title information is judged; and further extracting the long text information of the project to be evaluated and the historical evaluation project according to the preliminary similarity judgment result, carrying out similarity analysis according to the long text information, and finally determining whether the projects are similar or not according to the analysis result. The method is based on the text characteristics of the scientific and technological project declaration material, and the short text similarity analysis and the long text similarity analysis are combined to judge whether two projects are similar, so that the method can assist a reviewer in quickly judging whether the projects are repeatedly declared, the efficiency and accuracy of similarity judgment are guaranteed, intelligent auxiliary project approval can be realized, repeated project approval is avoided, and the efficiency of project management work is guaranteed to be increased.
Additional features and advantages of the invention will be set forth in the detailed description which follows.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a scientific and technological project similarity analysis method according to an embodiment of the present invention.
FIG. 2 is a block diagram of a Doc2vec PV-DM according to an embodiment of the invention.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In addition, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail so as not to obscure the present invention.
Referring to fig. 1, an embodiment of the present invention provides a method for analyzing similarity of scientific and technological projects, including:
s1, obtaining an electronic document of the declared material of the project to be evaluated, and extracting the text of the electronic document to obtain the title information of the project to be evaluated;
for example, the "research on key technologies and development patterns of the source-end-base integrated energy system" is described.
Step S2, obtaining an ith historical review project declaration material electronic document, and performing text extraction on the ith historical review project declaration material electronic document to obtain historical title information of the ith historical review project;
for example, the research on the comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technology is carried out.
Step S3, carrying out short text similarity analysis according to the information of the subject to be evaluated and the historical title information of the ith historical evaluation project, and preliminarily judging whether the information of the subject to be evaluated and the historical title information of the ith historical evaluation project are similar according to the analysis result; if yes, sequentially executing steps S4-S5, otherwise executing step S6; wherein the initial value of i is 1;
step S4, performing text extraction on the electronic document of the declaration material of the project to be evaluated to obtain long text information to be evaluated of the project to be evaluated, and performing text extraction on the electronic document of the declaration material of the ith historical project to obtain the long text information of the historical project;
step S5, according to the long text information to be evaluated and the historical long text information of the ith historical evaluation project, carrying out long text similarity analysis, and finally judging whether the two are similar according to the analysis result;
step S6, judging whether i is less than N; if yes, making i equal to i +1, and returning to the step S2; if not, outputting the similar judgment results between the project to be evaluated and all the historical evaluation projects to a display unit for displaying, and ending the analysis process; wherein M is a preset number; where N is the total number of historical review items. M and N are integers.
According to the method, the header information of the declaration material electronic document of the project to be evaluated and the current historical evaluation project is extracted, and the similarity of the extracted header information is judged, and because the header information is a short text, the calculation amount is small, the required calculation resources are less, and the consumed calculation time is very small, the method is beneficial to traversing all the historical evaluation projects, preliminarily and quickly judging the similarity between the project to be evaluated and all the historical evaluation projects, and realizing the preliminary screening of the similarity projects; and further extracting the long text information of the project to be evaluated and the historical evaluation project according to the preliminary similarity judgment result, performing similarity analysis according to the long text information, and finally determining whether the project to be evaluated and the current historical evaluation project are similar or not according to the analysis result. In this embodiment, based on the text characteristics of the science and technology project declaration material, a method combining short text similarity analysis and long text similarity analysis is provided to determine whether two projects are similar.
Optionally, the step S31 includes:
step S31, obtaining the longest continuous common substring between the to-be-evaluated subject information and the historical title information of the ith historical evaluation project, and removing the longest continuous common substring from the to-be-evaluated subject information and the historical title information of the ith historical evaluation project respectively to obtain a first character string and a second character string;
illustratively, the longest continuous common substring of the key technology and development mode research of the source-end base comprehensive energy system and the research of the comprehensive energy system multi-energy conversion simulation and comprehensive energy efficiency evaluation technology is the comprehensive energy system.
Specifically, the reason for selecting the continuous common substring instead of the Longest Common Subsequence (LCS) in this embodiment is that the longest common subsequence may split an originally semantic noun into single words, whereas a continuous substring occurring in both character strings may be a complete noun; where the longest continuous common substring problem is finding the substring for which two or more known strings are longest, the longest continuous common substring problem differs from the longest common subsequence problem in that the subsequences need not be continuous, but the substrings must be.
Step S32, calculating the edit distance between the first character string and the second character string;
specifically, the editing distance refers to the minimum editing times required for converting one substring into another substring between the two substrings; wherein the editing operation comprises deletion, insertion, replacement and the like.
The edit distance may be expressed as:
Figure BDA0002773726760000081
where D (str1, str2, i, j) represents the edit distance between the first i characters of the string str1 and the first j characters of the string str2, str1iRepresenting the ith sub-string of the string str 1. The initial value D (str1, str2,0,0) is 0.
The above equation is a recursive definition, and if there are strings s1 and s2, which have lengths of m and n, respectively, a matrix of matching relationships of (m +1) × (n +1) orders is typically used to calculate the edit distance. The values of the elements in the matrix are:
Figure BDA0002773726760000091
wherein d isi,jThe values of the ith row and j column in the matrix are shown, and are given belowAn example of a matching relationship matrix is obtained, and the edit distances of the character strings "similarity calculation" and "calculation similarity" are obtained, and the obtained edit distance is 4, as shown in table 1:
TABLE 1 edit distance computation matrix
0 Phase (C) Like Degree of rotation Meter Calculating out
Meter 1 2 3 3 4
Calculating out 2 2 3 4 3
Phase (C) 2 3 3 4 4
Like 3 2 3 4 5
Degree of rotation 4 3 2 3 4
Step S33, calculating the similarity between the title information to be reviewed and the historical title information of the ith historical review project according to the editing distance;
specifically, in this embodiment, some scientific and technological project sets are randomly selected, and the project title similarity calculation of the existing method and the project title similarity calculation of this embodiment are performed respectively, and the comparison results are shown in table 2 below: it can be seen that the calculated editing distance is relatively small, and the similarity result is more consistent with the similarity value close to the reality. In addition, the results obtained by the existing method and the method of the embodiment are the same when no common substring exists.
TABLE 2 name similarity comparison results under different algorithms
Figure BDA0002773726760000092
Figure BDA0002773726760000101
It should be noted that the method of the present embodiment is used for calculating and comparing the titles of the projects, and can achieve a more desirable effect. For example, the item A is similar to the item title of the item B in the main content subtitle, so that the item A and the item B may have more or less similar relations, and the similar relations are used as a preliminary judgment basis for repeated declaration of the items; moreover, the calculation comparison method needs a small amount of calculation, the electronic documents of the science and technology project declaration materials are usually large texts, if each historical project is compared with the full text in a conventional manner, a large amount of time and calculation resources are inevitably consumed, and the second similarity judgment is further performed according to the long text only when the similarity exists in the initial judgment, so that the technical problem can be effectively solved by the method.
Step S34, judging whether the information of the subject to be evaluated and the historical title information of the ith historical evaluation project are similar or not according to the comparison result of the similarity of the information of the subject to be evaluated and the historical title information of the ith historical evaluation project and a first similarity threshold;
specifically, when the similarity is greater than the first similarity threshold, it is determined that the subject information to be reviewed is similar to the ith historical review item, and at this time, the steps S4 to S5 are continuously performed.
Optionally, the step S31 includes:
step S311, setting the subject information to be evaluated as a character string S1The historical title information of the ith historical review project is a character string si
Step S312, finding character string S1And siLongest continuous common substring sz
Step S313, if the longest continuous common substring SzIs greater than 2, the character string s is respectively connected1And siS inzAfter removal, a new 2 character string s is obtained10And si0And order s1=s10,si=si0Then returning to step S312; if the longest consecutive common substring szIs less than or equal to 2, s is output10As a first string, si0As a second string.
Optionally, the calculating the similarity between the to-be-reviewed title information and the historical title information of the ith historical review project according to the edit distance includes:
Figure BDA0002773726760000111
wherein s is10Representing a first string, si0Representing a second string, sim(s)10,si0) Calculating the similarity between the title information to be reviewed and the historical title information of the ith historical review project according to the editing distance, ED represents the editing distance between the first character string and the second character string, len(s)10) Indicates the length of the first string, len(s)i0) Indicating the length of the second string.
Optionally, the information of the to-be-evaluated subject includes a project main title of the to-be-evaluated item and a subtitle in research content; the historical title information of the ith historical review project comprises a project main title of the ith historical review project and a subtitle in research content;
specifically, in general, a project main title, that is, a project name, needs to be filled in a declaration material (project declaration form) of a scientific project; and describes the main study, which is generally described in several aspects, each of which has a subheading.
The step S31 specifically includes: obtaining the longest continuous common substring between each title information in the to-be-evaluated title information and each title information in the historical title information of the ith historical evaluation project, and respectively removing the longest continuous common substrings to obtain a first character string sjk1And a second character string sjk2(ii) a Wherein s isjk1Showing a first character string, s, obtained by removing the jth title information in the to-be-evaluated title information and the kth title information in the historical title information after removing the maximum continuous common substringjk2Indicating that the kth title information in the historical title information is removed from the historical title informationExamining a maximum continuous public substring of jth title information in the title information to obtain a second character string;
note that, both the main title of the project and the subtitle in the content under study are regarded as one piece of title information.
The step S32 specifically includes: calculating all the first strings sjk1And a second character string s corresponding theretojk2The editing distance between the two groups is obtained to obtain an editing distance set; each title information in the to-be-evaluated title information has k corresponding editing distances;
specifically, if there are j pieces of title information in the to-be-evaluated title information, j × k pieces of editing distance data are correspondingly associated with the to-be-evaluated title information.
The step S33 specifically includes: calculating all first character strings s according to the edit distance setjk1And a second character string s corresponding theretojk2Calculating the similarity between the information of the title to be evaluated and the information of the historical title of the ith historical evaluation project according to all the similarity calculation results; and each title information in the to-be-evaluated title information has corresponding k similarity calculation results.
Specifically, correspondingly, the title information to be reviewed has j × k similarity data; and for the j multiplied by k similarity data, taking the average similarity of the j multiplied by k similarity data and outputting the average similarity as the similarity of the to-be-evaluated subject information and the historical subject information of the ith historical evaluation project.
Optionally, the outputting the similar judgment results between the to-be-evaluated item and all the historical evaluation items to a display unit for displaying includes:
if at least one historical review project is similar to the project to be reviewed, outputting the declaration material electronic document of the at least one historical review project to a display unit;
if at least one historical evaluation project is similar to the to-be-evaluated project, sorting the similarity of the to-be-evaluated project and all the historical evaluation projects, and then selecting the declaration material electronic documents of the M historical evaluation projects with the highest similarity to output to a display unit for displaying; m is a preset number.
Specifically, after the similarity determination of the method of the present embodiment, the M most similar historical review items are output for the reviewers to further confirm.
Optionally, the step S5 includes:
step S51, inputting pre-trained Doc2vec models respectively according to the long text information to be evaluated and the historical long text information of the ith historical evaluation project, and outputting corresponding paragraph vectors to be evaluated and the historical paragraph vectors of the ith historical evaluation project;
step S52, calculating the similarity between the paragraph vector to be reviewed and the historical paragraph vector of the ith historical review project;
illustratively, the similarity between two paragraph vectors may be determined according to the distance between them, wherein the closer the distance the greater the similarity.
It is understood that, in the present embodiment, the long text information may include multiple aspects, such as a project summary, main research content, and the like, each aspect includes multiple paragraphs, and the multiple aspects may be separated and individually subjected to similarity calculation; finally, carrying out comprehensive analysis calculation according to the similarity of multiple aspects, for example, taking the average value of the similarity of the multiple aspects as the analysis result of the similarity of the long text; for example, the similarity of multiple aspects is multiplied by corresponding preset weights respectively and then accumulated to be used as a long text similarity analysis result; for the similarity calculation of a certain aspect, for example, there are n paragraphs on the E aspect of the item to be evaluated, there are m paragraphs on the E aspect of the current history evaluation item, after the similarity calculation is performed on the multiple paragraphs on the certain aspect of the item to be evaluated and the multiple paragraphs on the certain aspect corresponding to the current history evaluation item, each paragraph on the E aspect of the item to be evaluated has m similarity calculation data, then there are n × m similarity calculation data on the n paragraphs on the E aspect of the item to be evaluated, and the similarity average value of the n × m similarity calculation data is used as the similarity of the item to be evaluated and the current history evaluation item on the E aspect.
And step S53, judging whether the segment vector to be evaluated and the historical segment vector of the ith historical evaluation project are similar or not according to the comparison result of the similarity of the segment vector to be evaluated and the historical segment vector of the ith historical evaluation project and a second similarity threshold.
Specifically, in the embodiment, the Doc2vec Model is trained by specifically using a PV-DM (distribution Memory Model of Paragraph vectors) training method, as shown in fig. 2, a frame diagram of the Doc2vec PV-DM in the embodiment is shown, and it can be seen from fig. 2 that a vector representation of each Paragraph/sentence is added in addition to a vector at a word level. For example, for a sentence 'the cat sat on', if the word on in the sentence is to be predicted, the prediction can be performed not only according to the corresponding features generated by other words, but also according to the generated features of other words and sentences. Each paragraph/sentence is mapped into a vector space, which may be represented by a column of a matrix. Each word is also mapped to vector space, which can be represented by a column of the matrix. And then, cascading or averaging the paragraph vector and the word vector to obtain features, and predicting a next word in the sentence. A paragraph vector/sentence vector can also be considered as a word, which acts as a memory unit for the context or as a subject for the paragraph. Wherein, during training, the context length is fixed, and the training set is generated by using a sliding window method. And paragraph/sentence vectors are shared in that context. The training process of the Doc2vec model in this embodiment is specifically as follows, and mainly includes the following ((i) and (ii)):
training a model, and obtaining a word vector, a softmax parameter and a paragraph vector/sentence vector in known training data.
Inference stage, for new paragraphs, gets its vector expression. Specifically, more columns are added in the matrix, and in the case of a fixed length, the training is performed by using the method described above, and a gradient descent method is used to obtain a new D (paragraph vector matrix), thereby obtaining a vector expression of a new paragraph.
Optionally, the step S1 further includes:
text extraction is carried out on the electronic document of the declaration material of the project to be evaluated to obtain project technical field information of the project to be evaluated;
the obtaining of the ith electronic document of history review project declaration material in step S2 specifically includes:
acquiring an ith historical review project declaration material electronic document in a database corresponding to the project technical field according to the project technical field information of the project to be reviewed;
wherein all the historical review items in the step S6 are all the historical review items in the database of the corresponding project technology field.
Specifically, since there are many reviewed historical scientific and technological projects, a preliminary classification concept is further proposed in this embodiment, the electronic documents of the declaration materials of different types of historical scientific and technological projects are respectively stored in different databases, and when similarity analysis is performed, the similarity comparison is performed between the project to be reviewed and the historical scientific and technological projects in the corresponding technical fields according to the technical fields of the project to be reviewed, thereby effectively reducing the calculation workload.
To sum up, the problem of large data volume for science and technology projects is addressed in this embodiment, and 3 aspects of targeted setting are proposed altogether, and first the database classification is screened, second the preliminary similar screening of short text, and the third is the secondary similar screening of long text, screens layer by layer, and the whole process not only can accurately carry out similarity analysis, and the work load is less moreover, and the processing speed is very fast.
Another embodiment of the present invention further provides a computer device, including: a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the scientific and technological project similarity analysis method according to the above-mentioned embodiment.
Of course, the computer device may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the computer device may also include other components for implementing the functions of the device, which are not described herein again.
Illustratively, the computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the computer device.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is the control center for the computer device and connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used for storing the computer program and/or unit, and the processor may implement various functions of the computer device by executing or executing the computer program and/or unit stored in the memory and calling data stored in the memory. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Another embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the scientific and technical project similarity analysis method according to the above-mentioned embodiment.
Specifically, the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
To sum up, the embodiment of the invention provides a scientific and technological project similarity analysis method and system, computer equipment and a computer readable storage medium, the title information of the declaration material electronic documents of the project to be evaluated and the historical evaluation project is extracted, and the similarity of the extracted title information is judged.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A scientific and technological project similarity analysis method is characterized by comprising the following steps:
s1, obtaining an electronic document of the declared material of the project to be evaluated, and extracting the text of the electronic document to obtain the title information of the project to be evaluated;
step S2, obtaining an ith historical review project declaration material electronic document, and performing text extraction on the ith historical review project declaration material electronic document to obtain historical title information of the ith historical review project;
step S3, carrying out short text similarity analysis according to the information of the subject to be evaluated and the historical title information of the ith historical evaluation project, and preliminarily judging whether the information of the subject to be evaluated and the historical title information of the ith historical evaluation project are similar according to the analysis result; if yes, sequentially executing steps S4-S5, otherwise executing step S6; wherein the initial value of i is 1;
step S4, performing text extraction on the electronic document of the declaration material of the project to be evaluated to obtain long text information to be evaluated of the project to be evaluated, and performing text extraction on the electronic document of the declaration material of the ith historical project to obtain the long text information of the historical project;
step S5, according to the long text information to be evaluated and the historical long text information of the ith historical evaluation project, carrying out long text similarity analysis, and finally judging whether the two are similar according to the analysis result;
step S6, judging whether i is less than N; if yes, making i equal to i +1, and returning to the step S2; if not, outputting the similar judgment results between the project to be evaluated and all the historical evaluation projects to a display unit for displaying, and ending the analysis process; wherein M is a preset number; where N is the total number of historical review items.
2. The scientific and technological project similarity analysis method according to claim 1, wherein the step S31 includes:
step S31, obtaining the longest continuous common substring between the to-be-evaluated subject information and the historical title information of the ith historical evaluation project, and removing the longest continuous common substring from the to-be-evaluated subject information and the historical title information of the ith historical evaluation project respectively to obtain a first character string and a second character string;
step S32, calculating the edit distance between the first character string and the second character string;
step S33, calculating the similarity between the title information to be reviewed and the historical title information of the ith historical review project according to the editing distance;
and step S34, judging whether the information to be evaluated and the historical title information of the ith historical evaluation project are similar or not according to the comparison result of the similarity of the information to be evaluated and the historical title information of the ith historical evaluation project and a first similarity threshold value.
3. The scientific and technological project similarity analysis method according to claim 2, wherein the step S31 includes:
step S311, setting the subject information to be evaluated as a character string S1The historical title information of the ith historical review project is a character string si
Step S312, finding character string S1And siLongest continuous common substring sz
Step S313, if the longest continuous common substring SzIs greater than 2, the character string s is respectively connected1And siS inzAfter removal, a new 2 character string s is obtained10And si0And order s1=s10,si=si0Then returning to step S312; if the longest consecutive common substring szIs less than or equal to 2, s is output10As a first string, si0As a second string.
4. The method for analyzing similarity of technical projects according to claim 2, wherein the calculating the similarity between the information about the title to be reviewed and the information about the title of the ith historical review project according to the edit distance comprises:
Figure FDA0002773726750000021
wherein s is10Representing a first string, si0Representing a second string, sim(s)10,si0) Calculating the similarity between the title information to be reviewed and the historical title information of the ith historical review project according to the editing distance, ED representing the editing distance between the first character string and the second character stringFrom, len(s)10) Indicates the length of the first string, len(s)i0) Indicating the length of the second string.
5. A scientific and technological project similarity analysis method according to claim 2, wherein the information of the titles to be evaluated comprises project main titles and sub-titles in research contents of the projects to be evaluated; the historical title information of the ith historical review project comprises a project main title of the ith historical review project and a subtitle in research content;
the step S31 specifically includes: obtaining the longest continuous common substring between each title information in the to-be-evaluated title information and each title information in the historical title information of the ith historical evaluation project, and respectively removing the longest continuous common substrings to obtain a first character string sjk1And a second character string sjk2(ii) a Wherein s isjk1Showing a first character string, s, obtained by removing the jth title information in the to-be-evaluated title information and the kth title information in the historical title information after removing the maximum continuous common substringjk2Representing a second character string obtained after removing the largest continuous common substring of the kth title information in the historical title information and the jth title information in the to-be-evaluated title information;
the step S32 specifically includes: calculating all the first strings sjk1And a second character string s corresponding theretojk2The editing distance between the two groups is obtained to obtain an editing distance set; each title information in the to-be-evaluated title information has k corresponding editing distances;
the step S33 specifically includes: calculating all first character strings s according to the edit distance setjk1And a second character string s corresponding theretojk2Calculating the similarity between the information of the title to be evaluated and the information of the historical title of the ith historical evaluation project according to all the similarity calculation results; and each title information in the to-be-evaluated title information has corresponding k similarity calculation results.
6. The scientific and technological project similarity analysis method according to claim 1, wherein the outputting of the similarity determination results between the project to be reviewed and all the historical review projects to a display unit for display comprises:
if at least one historical review project is similar to the project to be reviewed, outputting the declaration material electronic document of the at least one historical review project to a display unit;
if at least one historical evaluation project is similar to the to-be-evaluated project, sorting the similarity of the to-be-evaluated project and all the historical evaluation projects, and then selecting the declaration material electronic documents of the M historical evaluation projects with the highest similarity to output to a display unit for displaying; m is a preset number.
7. The scientific and technological project similarity analysis method according to claim 1, wherein the step S5 includes:
step S51, inputting pre-trained Doc2vec models respectively according to the long text information to be evaluated and the historical long text information of the ith historical evaluation project, and outputting corresponding paragraph vectors to be evaluated and the historical paragraph vectors of the ith historical evaluation project;
step S52, calculating the similarity between the paragraph vector to be reviewed and the historical paragraph vector of the ith historical review project;
and step S53, judging whether the segment vector to be evaluated and the historical segment vector of the ith historical evaluation project are similar or not according to the comparison result of the similarity of the segment vector to be evaluated and the historical segment vector of the ith historical evaluation project and a second similarity threshold.
8. A scientific and technological project similarity analysis method according to claim 1,
the step S1 further includes:
text extraction is carried out on the electronic document of the declaration material of the project to be evaluated to obtain project technical field information of the project to be evaluated;
the obtaining of the ith electronic document of history review project declaration material in step S2 specifically includes:
and acquiring an ith historical review project declaration material electronic document in a database corresponding to the project technical field according to the project technical field information of the project to be reviewed.
9. A computer device, comprising: a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the scientific and technological project similarity analysis method according to any one of claims 1 to 8.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the scientific project similarity analysis method according to any one of claims 1 to 8.
CN202011258083.6A 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium Active CN112199938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011258083.6A CN112199938B (en) 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011258083.6A CN112199938B (en) 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112199938A true CN112199938A (en) 2021-01-08
CN112199938B CN112199938B (en) 2023-11-14

Family

ID=74033475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011258083.6A Active CN112199938B (en) 2020-11-12 2020-11-12 Science and technology project similarity analysis method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112199938B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784569A (en) * 2021-02-04 2021-05-11 北京秒针人工智能科技有限公司 Method, system, equipment and storage medium for realizing similar text aggregation
CN112926299A (en) * 2021-03-29 2021-06-08 杭州天谷信息科技有限公司 Text comparison method, contract review method and audit system
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN113704427A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Text provenance determination method, device, equipment and storage medium
CN113762719A (en) * 2021-08-03 2021-12-07 远光软件股份有限公司 Text similarity calculation method, computer equipment and storage device
CN115801483A (en) * 2023-02-10 2023-03-14 北京京能高安屯燃气热电有限责任公司 Information sharing processing method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446954A (en) * 2015-11-18 2016-03-30 广东省科技基础条件平台中心 Project duplicate checking method for science and technology big data
CN106095865A (en) * 2016-06-03 2016-11-09 中细软移动互联科技有限公司 A kind of trade mark text similarity reviewing method
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN110163476A (en) * 2019-04-15 2019-08-23 重庆金融资产交易所有限责任公司 Project intelligent recommendation method, electronic device and storage medium
CN111782797A (en) * 2020-07-13 2020-10-16 贵州省科技信息中心 Automatic matching method for scientific and technological project review experts and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446954A (en) * 2015-11-18 2016-03-30 广东省科技基础条件平台中心 Project duplicate checking method for science and technology big data
CN106095865A (en) * 2016-06-03 2016-11-09 中细软移动互联科技有限公司 A kind of trade mark text similarity reviewing method
CN107122340A (en) * 2017-03-30 2017-09-01 浙江省科技信息研究院 A kind of similarity detection method for the science and technology item return analyzed based on synonym
CN110163476A (en) * 2019-04-15 2019-08-23 重庆金融资产交易所有限责任公司 Project intelligent recommendation method, electronic device and storage medium
CN111782797A (en) * 2020-07-13 2020-10-16 贵州省科技信息中心 Automatic matching method for scientific and technological project review experts and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张自锋;周育忠;陶秀杰;: "文本相似度指标分析及文本相似性分析方法研究", 信息系统工程, no. 04 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784569A (en) * 2021-02-04 2021-05-11 北京秒针人工智能科技有限公司 Method, system, equipment and storage medium for realizing similar text aggregation
CN112784569B (en) * 2021-02-04 2024-04-19 北京秒针人工智能科技有限公司 Method, system, equipment and storage medium for realizing similar text aggregation
CN113064979A (en) * 2021-03-10 2021-07-02 国网河北省电力有限公司 Keyword retrieval-based method for judging construction period and price reasonability
CN112926299A (en) * 2021-03-29 2021-06-08 杭州天谷信息科技有限公司 Text comparison method, contract review method and audit system
CN112926299B (en) * 2021-03-29 2024-04-09 杭州天谷信息科技有限公司 Text comparison method, contract review method and auditing system
CN113762719A (en) * 2021-08-03 2021-12-07 远光软件股份有限公司 Text similarity calculation method, computer equipment and storage device
CN113704427A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Text provenance determination method, device, equipment and storage medium
CN115801483A (en) * 2023-02-10 2023-03-14 北京京能高安屯燃气热电有限责任公司 Information sharing processing method and system

Also Published As

Publication number Publication date
CN112199938B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
CN112199938B (en) Science and technology project similarity analysis method, computer equipment and storage medium
WO2019174132A1 (en) Data processing method, server and computer storage medium
CN108875059B (en) Method and device for generating document tag, electronic equipment and storage medium
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN112199937B (en) Short text similarity analysis method and system, computer equipment and medium thereof
CN110827131B (en) Tax payer credit evaluation method based on distributed automatic feature combination
CN112199940B (en) Project review method and storage medium
CN114547315A (en) Case classification prediction method and device, computer equipment and storage medium
CN112883730A (en) Similar text matching method and device, electronic equipment and storage medium
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN111429184A (en) User portrait extraction method based on text information
CN112199939B (en) Intelligent recommendation method and storage medium for review experts
CN112381381B (en) Expert's device is recommended to intelligence
CN113703773A (en) NLP-based binary code similarity comparison method
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN114462383B (en) Method, system, storage medium and equipment for obtaining design specification of building drawing
CN114580398A (en) Text information extraction model generation method, text information extraction method and device
CN112199941A (en) Scientific research project evaluation platform
CN112632951A (en) Method, computer equipment and storage medium for intelligently recommending experts
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
CN114443803A (en) Text information mining method and device, electronic equipment and storage medium
CN112329425B (en) Scientific research project intelligent review method and storage medium
CN112837148B (en) Risk logic relationship quantitative analysis method integrating domain knowledge
CN117494806B (en) Relation extraction method, system and medium based on knowledge graph and large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant