CN107967290A - A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data - Google Patents

A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data Download PDF

Info

Publication number
CN107967290A
CN107967290A CN201710928133.9A CN201710928133A CN107967290A CN 107967290 A CN107967290 A CN 107967290A CN 201710928133 A CN201710928133 A CN 201710928133A CN 107967290 A CN107967290 A CN 107967290A
Authority
CN
China
Prior art keywords
key
key technology
point
technology
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710928133.9A
Other languages
Chinese (zh)
Inventor
刘玮
马欢
崔佳
王益静
张永铮
常鹏
杨芳
李锐光
林绅文
徐小琳
纪玉春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201710928133.9A priority Critical patent/CN107967290A/en
Publication of CN107967290A publication Critical patent/CN107967290A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data.This method is:Project library is parsed, the basic information of problem is extracted and parses the document information of each problem, problem direction;Descriptor is extracted from the heading message of the setting document of each problem, the problem key technology as corresponding problem;The problem direction for belonging to same area is clustered;For the problem in same cluster result, key technical index is parsed in general charter, demand analysis explanation from each problem, then according to the problem key technology of each problem in same cluster result and the degree of correlation of key technical index, problem key technology is associated with key technical index, each key technology is formed and corresponds to some key technical index;Ultimately generate the knowledge mapping of the contingency table, i.e. project library between the problem key technology key technical index of field problem direction.The present invention, which can analyze, extracts future technical advances trend.

Description

A kind of knowledge mapping network establishing method and system based on magnanimity scientific research data, Medium
Technical field
It is a kind of knowledge mapping based on magnanimity scientific research data the present invention relates to scientific research data information extraction classification field.
Background technology
The characteristics of scientific research knowledge profile information extracts be using clustering technique complete problem direction, key technology classification with Association, scientific research knowledge collection of illustrative plates are also equipped with any switching laws function between version, divide field, direction, key technical index, key technology etc. Different angle is sorted out, and the operation such as node motion, merging can be carried out between its different angle, different directions.
What the application demand of the present invention was intended to automate carries out information extraction to scientific research document, so needs pair The parsing that history problem document and the following problem document newly imported are automated, extracts necessary category in problem automatically Property information, and possess inspection institute analyzing subsystem, user can retrieve related problem document, displaying user institute by retrieval The topic information of care, carrys out the saving human resources of great dynamics with this, increases work efficiency.
There are problems that in the problem document information of magnanimity, on the one hand lack the incidence relation between problem, annual scientific research Highest priority highlights deficiency;On the other hand content depth analysis is lacked, it is difficult to play history problem to grasping key technology research The effect of progress and technical merit.For these deficiencies, the present invention works out improved TextRank algorithm and clustering technique, and And reach desired effect.
The content of the invention
For problems of the prior art, it is an object of the invention to provide a kind of knowing based on magnanimity scientific research data Know collection of illustrative plates network establishing method and system, medium.The present invention grasps key technology research by the excavation to history problem data The technical merit be in progress, reached, analysis extract future technical advances trend.
The technical scheme is that:
A kind of knowledge mapping network establishing method based on magnanimity scientific research data, its step include:
1) project library is parsed, the basic information of problem is extracted and parses the document information of each problem, problem side To;Wherein, the basic information includes problem fields;
2) descriptor is extracted from the heading message of the setting document of each problem, the problem key skill as corresponding problem Art;
3) the problem direction for belonging to same area is clustered;For the problem in same cluster result, from each problem General charter, demand analysis explanation in parse key technical index, then according to each problem in same cluster result The degree of correlation of problem key technology and key technical index, problem key technology is associated with key technical index, is formed Each key technology corresponds to some key technical index;Ultimately generate field-problem direction-problem-key technology-key technology Contingency table between index, i.e., the knowledge mapping of described project library.
Further, the method for parsing the problem direction is:
1) the destination document content of problem is segmented, digraph is formed according to word segmentation result;
2) for each participle point V in the digraphi, use formulaCalculate participle point ViFinal power Weight S (vi);Wherein, ln (Vi) it is to be directed toward participle point ViParticiple point set, Out (Vj) it is participle point ViIt is directed toward other participles The set of point, d is adjustment factor, wjiTo segment point vjTo participle point viSide weight, wiFor participle point ViSynthetic weights Weight;
3) problem direction of some participles as the problem is chosen according to the final weight of participle point.
Further, participle point ViComprehensive weight wi=w1*Ai+w2*Bi+w3*Ci+w4*Di;Wherein, AiFor participle point ViTF-IDF, its weight is w1;BiRepresent participle point ViPosition, its weight is w2, CiRepresent participle point ViPart of speech, its weigh Weight is w3, DiRepresent participle point ViLength, its weight is w4
Further, the method for obtaining the problem key technology is:Key technology document, development from problem summarize report Descriptor is extracted in the title of announcement, the key technology as the problem;Semantic-based analysis, analysis of key technical documentation, grind The text of final report technology processed, find text in band " technology " word, judge " technology " before word segmentation result whether be Noun, name verb or gerundial form, if it is, crucial skill of the word segmentation result as the problem by " technology " and its above Art;Key technology document, develop final report technology text in find name verb combination or gerund combination, and for text This keyword, then as key technology.
Further, problem direction, key technology are merged respectively, same kind, similar problem direction is closed And to together, same kind, similar key technology are merged together.
Further, the similarity measure in the problem direction under same domain is carried out for new problem direction, is maximized Max, it is no by the new problem direction as the alternate item with its most like problem direction if max exceedes given threshold K Then the new problem direction is added under corresponding field;Carried out for new key technology with the key technology phase under problem direction Like degree calculate, be maximized max, if max exceedes given threshold G, by the new key technology as with its most like pass The alternate item of key technology, otherwise adds the new key technology under corresponding problem direction.
Further, it is close to further include project number, problem title, contract number, carrier, problem for the basic information Level, requisition number, sponsor sections, sponsor sections counterpart people, scientific and technical department director and participant;The document information includes Project number, problem title, achievement form, main research, scientific research personnel, key technology and key technical index.
A kind of knowledge mapping network building systems based on magnanimity scientific research data, it is characterised in that parse mould including problem Block, problem key technology extraction module and knowledge mapping generation module;Wherein,
Problem parsing module, for parsing project library, extracts the basic information of problem and parses each problem Document information, problem direction;Wherein, the basic information includes problem fields;
Problem key technology extraction module, for extracting descriptor from the heading message of the setting document of each problem, Problem key technology as corresponding problem;
Knowledge mapping generation module, for being clustered to the problem direction for belonging to same area;For same cluster knot Problem in fruit, general charter, demand analysis from each problem parse key technical index in illustrating, then according to same The problem key technology of each problem and the degree of correlation of key technical index in cluster result, to problem key technology and key technology Index is associated, and is formed each key technology and is corresponded to some key technical index;Ultimately generate field-problem direction-problem- Contingency table between key technology-key technical index, i.e., the knowledge mapping of described project library.
A kind of computer-readable recording medium for storing computer program, it is characterised in that storage computer program, it is described Computer program includes instruction, and described instruction includes each step in any of the above-described method.
The information extraction of scientific research knowledge collection of illustrative plates in the present invention, directly extracts fixed attribute by project library file respectively, The attribute being not present in project library is parsed by problem document.Data are pre-processed, remove noise, parses and meets this The data of project specification, data are stored in database.Problem direction is parsed, each problem there should be one or more problem side To by computer program, to the title of problem, main research is parsed, and the descriptor extracted is as this problem Direction.Synonymous different word problems are solved by near synonym storehouse, by synonymous or equidirectional cluster together.Key technology Parsing cluster, key technology is extracted based on word frequency, the meaning of a word.By the calculating of thesaurus and similarity, fixed threshold is set, The similarity between key technology is calculated, key technology is clustered, by the high cluster of the degree of correlation together, to reduce key The species of technology.Engine reads configuration file first, monitors the file under fixed catalogue by configuration file, parses project library, Problem basic information in project library is stored.The problem that existing history problem document, reception are new is handled respectively Document two parts.Information extraction is carried out to problem document, is divided into essential information and problem direction, key technology information, passes through people Work candidate system improves direction and the key technology of extraction.Similarity measure finally is carried out for problem direction, key technology, it is complete Kind contingency table.
Scientific research knowledge profile information extraction module is broadly divided into 3 parts:WEB front-end, WEB rear ends and engine.WEB front-end Interacted with user, the user's operation page, request is sent to rear end.Rear end is mainly the processing to business and data, is plucked Want extract function.Engine section is responsible for the information extraction of scientific documents, the extraction and problem direction to basic information and key The merging of technology, mainly to for database, the processing to data.Shown in scientific research knowledge collection of illustrative plates system assumption diagram (such as Fig. 1). Engine reads configuration file first, monitors the file under fixed catalogue by configuration file, project library is parsed, by project library Problem basic information is stored.Problem document two parts that existing history problem document, reception are new are handled respectively.It is right Problem document carries out information extraction, is divided into essential information and problem direction, key technology information extraction, and pass through artificial candidate The direction and key technology that system perfecting extracts.The key technology used is as follows:Data analysis and data mining;Lightweight web Frame SSI;Apache POI and PDFBox.
User uses notebook operation front end system;Front and rear end communicates, and rear end carries out business processing, the business of rear end Processing with database by interacting, and server at a high speed by engine realized by retrieval;Engine is responsible for parsing data, extracts letter Breath, is retrieved at a high speed.Rear end is communicated with engine by Socket, to improve the speed of transmission data.Knowledge mapping Organization Chart is (as schemed 2).System receives data entirely through a server, receives document, parsing document is carried out by server, and document is carried out Information extraction, the necessary information of problem is stored, and is used for remaining module.The browser access system that user passes through notebook System, the deployment diagram (such as Fig. 3) of knowledge mapping entirety.
Knowledge mapping can carry out sequence of operations, such as locking, breviary, newly-increased, deletion and search, or increase version or delete Except version.Knowledge mapping classification is followed successively by the different zones such as version, field, direction, key technology, key technical index, each There are multiple nodes in region, and each node can be increased newly, deleted, changing, is mobile.Such as one of version is selected, this version bag Containing multiple fields, multiple directions are included under each field, each direction node includes multiple key technologies, each is crucial Technology includes multiple key technical index.
1) project library is parsed
By parsing project library, extracting some basic informations of problem, (project number, problem title, contract number, undertake Unit, problem level of confidentiality, requisition number, fields, sponsor sections, sponsor sections counterpart people, scientific and technical department director, participant Member etc.), store in database, state is arranged to 0.Parse project achievement logical flow chart (such as Fig. 4).
2) document information of each problem is parsed
All documents in each problem are read, it is incomplete there are document content to solve the problems, such as, carried by multiple documents Take content information (project number, problem title, achievement form, main research, scientific research personnel, key technology, key technology Index etc.), to ensure the integrality of data.Document classification process of analysis figure is as shown in Figure 5.FilterMap collection is initialized first Close, File.listFiles () obtains file or file under this problem, traveled through.It is stored in by file type In FilterMap set, (general charter key is 1 to fixed number, and statement of requirements book is 2), to belong to because first having to extract to include Property most file, so to there is order.If in file deposit FileList, if there is acceptance document, only deposit and test Message in-coming shelves, are standard with examination.It is stored in by file type in fileMap.Document is parsed, generates analysis result basic, Project number is searched in projectMap, if there is storehouse is directly updated the data, projectMap removes the note of this project number Record;If it does not, being still acceptance document, storehouse is also updated the data, projectMap removes the record of this project number, this class Inscribe document deposit retrieval subsystem.
3) the problem direction of each problem is parsed
Each problem has one or more problem direction, and a problem extracts the direction of problem from multiple documents. Firstly the need of being segmented to document by segmenter, stop words is removed, noun, verb are only according to investigation problem direction Combination, arbitrarily only retains noun, verb, the word of adjective part of speech.Extract General layout Plan such as Fig. 6 in specific problem direction It is shown.Problem title, main research are parsed by computer program, is taken out and counted using improved TextRank algorithm The keyword of text is calculated, and problem direction is determined with reference to semantic analysis.
4) the problem key technology of each problem is parsed
Reported first by the key technology research of problem, the heading message in development final report, extract descriptor, then Using the text analyzing based on word frequency, part of speech, semanteme, descriptor is parsed as key technology.Parse key technology flow chart As shown in Figure 7.
5) key technical index is parsed
Index Content is analyzed in general charter, demand analysis explanation to problem, parses key technical index, By Full-Text Index Model, regard key technology as retrieval word, technical indicator regards document as, by key technology and crucial skill Art index carries out degree of correlation comparison, will be associated with the relevant index of key technology, formed each key technology correspond to zero to Multiple key technical index.It is as shown in Figure 8 that key technical index extracts flow chart.
6) merging in problem direction, key technology
Field-problem direction-key technology-key technical index is considered as to the knowledge mapping of an entirety, and by phase As problem direction, key technology be merged together, beneficial to problem direction under observation field, the development trend of key technology.Institute During storing topic information, the association of fields, problem direction, key technology, key technical index is carried out.By similarity height Problem direction, key technology merges (such as Fig. 9).There are multiple directions under field, there are multiple key technologies under direction, Key technology has multiple key technical index, forms the relation of multiple multi-to-multis.
Compared with prior art, the positive effect of the present invention is:
In information extraction application of function of the present invention, file monitoring modular correctly can be monitored file, Yi Jisheng Into destination file (.ok);Problem document parsing module can successfully parse document, and the problem to not having in storehouse into Row prompting;The merging in problem direction, key technology also corresponds to expected results, can be to similar problem direction, key technology Merge.In information extraction performance test, different size of problem document is parsed, resolution speed does not substantially become Change, it was demonstrated that resolution speed is not directly dependent upon with document size;Percentage of head rice test is parsed, more than 90%, meet needs parsing attribute Ask.Server end IP address is needed to configure when logging in for the first time, business personnel need to select data source when uploading text, click on importing and be Success can be uploaded, checks the historical record that history display data import.Multilingual is supported to import in link, content.User can To carry out sequence of operations, locking, breviary, newly-increased, deletion and search, or increase version on scientific research knowledge collection of illustrative plates or delete version This.The knowledge tree classification in each region is followed successively by version, field, direction, key technology, key technical index, and user can be to every A node is operated, and each node can be increased newly, deleted, changing, is mobile.Navigation bar can show all version informations, click on Version switching knowledge tree is the version;Mobile node can be placed first at navigation bar, the buffer area as node motion;It is user controllable Omniselector processed come change knowledge tree displaying position.After user clicks on node, right side area shows the associated all classes of the node Topic, user can select the problem for selecting to want the viewing corresponding time in frame in the time in the region.
Brief description of the drawings
Fig. 1 is scientific research knowledge collection of illustrative plates system assumption diagram;
Fig. 2 is knowledge mapping Organization Chart;
Fig. 3 is knowledge mapping integral deployment figure;
Fig. 4 is parsing project library logic chart;
Fig. 5 is document classification resolution logic figure;
Fig. 6 extracts overall design drawing for problem direction;
Fig. 7 is parsing key technology logical flow chart;
Fig. 8 extracts flow chart for technical indicator;
Fig. 9 is merging flow chart;
Figure 10 is information extraction flow chart.
Embodiment
To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's attached drawing to make Describe in detail as follows.
Documentation & info in scientific research knowledge collection of illustrative plates, which extracts, needs reading project library file, it is parsed, by problem base In plinth information deposit database, and parse problem document, extract other important informations of problem, and by candidate system into Row processing, finally by complete insertion of data into data storehouse, and carries out similarity measure, by phase to problem direction, key technology Merged like the high problem direction of degree, key technology, be merged together of a sort, ultimately form field-direction-class Incidence relation between topic-key technology.
Problem document, which extracts collectivity Scheme Design, includes at following 9 points:
1) catalogue in stored items storehouses is monitored, if there is untreated project library file or new projects library file, Then parsed, the essential information of problem is deposited into database.It is to be monitored in deposit database, if in database Without the topic information of this project number, then it is deposited into database, state status is arranged to 0, is not otherwise put in storage.
2) all status are inquired about from database (only to extract basic information from project library for 0, do not carry out key message Parsing) data, by (project number, id) deposit system global variables Map.
3) untreated document (file exists, but no corresponding .ok files) is found, untreated document is stored in In queue.
4) untreated document is if project library file.Untreated project library file is parsed, topic information batch is deposited Enter in database, and by project number, id deposit system global variables Map.Successful then generation .ok files.
5) untreated document is if problem document.It is deposited into etc. in pending queue, waits the thread of free time.
6) file is monitored by FileMonitor, new add file folder (scientific research document) is deposited into etc. pending In queue.
7) thread monitor queue, if queue is not sky, queue heads place to go is parsed, if the numbering of problem Existing in Map, then renewal is arranged to 2 into database, and by the state status of this problem in database, while by class Topic information is transmitted to searching system.
8) by artificial treatment, verification, modify to the basic data of problem, update the basic data (problem of problem Direction, key technology), state is arranged to 1, is end-state.
9) problem table is handled, the problem basic information after arrangement is handled, is inserted respectively into field table, problem direction Table, key technology table, key technical index table, field-directional correlation table, direction-problem contingency table, key technology-problem are closed Join table, state is respectively set to 1.
It is as shown in Figure 10 that Global Information extracts flow chart.
Problem direction utilizes TextRank algorithm, is improved for the deficiency of TextRank algorithm, using based on The innovatory algorithm of the TextRank of comprehensive weight, using " comprehensive weight " of G1 enabling legislations, respectively calculating word, and will TextRank algorithm is improved the keyword for calculating text based on comprehensive weight.Problem direction, which is extracted, to be needed to multiple documents Main research is handled, it is necessary first to document is segmented by segmenter, removes stop words, according to investigation problem Direction is only noun, dynamic contamination, only retains noun, verb, the word of adjective part of speech.The system uses HanLP's Words partition system, can not only be segmented by HanLP, but also can obtain the part of speech of word segmentation result at the same time.At the same time using improved TextRank algorithm extracts the keyword of problem.In addition to main research carry out word segmentation processing, with reference to keyword to its into The semantic participle of row, determines final problem direction.
TextRank algorithm is the algorithm of a kind of keyword, key words extraction, and the calculating based on figure, extracts from text Its keyword, descriptor, summarize the research contents of text.The algorithm is the power each put in calculating figure respectively based on scheming Weight, and weight is equally influenced by remaining point, and so the weight of dictionary is bigger, then the weight of coupled point is also got over Greatly.Comprehensive weight calculation formula is shown in formula (1):
wi=w1*Ai+w2*Bi+w3*Ci+w4*Di (1)
A:TF-IDF is represented, TF means word frequency (Term Frequency), and IDF means reverse document-frequency.B:Generation The position of table word, in beginning of the sentence, end of the sentence, sentence.C:Represent part of speech.D:Represent word length.W:Represent respective weight.
Textrank algorithms are exactly that this paper word segmentation results are formed digraph, if it is V that G (V, E), which is point set, side integrates as E, and E is the digraph of V × V subsets, if for some participle point V in figurei, it is all be directed toward this point set be denoted as ln (Vi), And pass through this participle point ViThe collection for being directed toward other points is combined into Out (Vj), then, participle point ViWeights can be counted by formula (2) Calculate:
D is adjustment factor, generally takes 0.85.
But TextRank algorithm assumes that and weights influence is not present between points, that is, the important journey each put Spend it is identical, it is but in the text, really not so.So the present invention needs to carry out weight calculation to different points, give important The certain big weights of point, increase its weight.I.e. calculation formula is changed into (3):
wjiFor point vjTo point viSide weight.
Improved TextRank extracting keywords algorithm:
1) segmenter is used, main research is segmented, the collection of all words is combined into point set V, and according to participle As a result associating between word and word is carried out, the relation on the side between word is established, establishes corresponding side E;
2) use formula (2) set of computations V in all the points weight, recursive calculation, until final calculation result restrain untill, Stop calculating;
3) after calculating the weight each put, descending sort is carried out to the weight of point, a certain range of phrase is chosen and does text This keyword.
The specific algorithm of improved TextRank is as follows:
A) carry out stop words to text to handle, the results set after being handled.
Such as:[software, personnel, programmer is advanced, programmer, system, analyst, project, manager]
B) each word segmentation result takes front and rear 5 words, and other words can be associated by being denoted as.
{
Software=[personnel, programmer is advanced, system],
Personnel=[software, programmer, program, system, analyst],
Programmer=[software, personnel are advanced, programmer, system, analyst, project, manager],
Advanced=[software, personnel, programmer, system, analyst, project, manager],
System=[software, personnel are advanced, programmer, analyst, project, manager],
Analyst=[personnel, advanced, programmer, system, project, manager],
Project=[advanced, programmer, system, analyst, manager],
Manager=[advanced, programmer, system, analyst, project]
}。
C) weight of each word is calculated according to the distance between word and word, i.e., according to calculate word i and front and rear 5 words away from From calculating the weight of word i.Distance calculates weight equation wji=(5-k+1)/5, k is the word number away from this word, dittograph language It is averaged.
{
Software=[personnel (1), programmer (0.8), advanced (0.6), system (0.2)],
Personnel=[software (1), programmer (0.8), advanced (0.8), system (0.4), analyst (0.2)],
Programmer=[software (0.8), personnel (1), advanced (1), and programmer (0.8), system (0.8), analyst (0.6), Project (0.4), is handled (0.2)],
Advanced=[software (0.6), personnel (0.8), programmer (1), system (0.8), analyst (0.6), project (0.4), Handle (0.2)],
System=[software (0.2), personnel (0.4), advanced (0.8), and programmer (0.8), analyst (1), project (0.8), Handle (0.6)],
Analyst=[personnel (0.2), advanced (0.6), programmer (0.6), system (1), project (0.8), manager (0.6)],
Project=[advanced (0.4), programmer (0.4), system (0.8), analyst (1), handles (1)],
Manager=[advanced (0.2), programmer (0.4), system (0.6), analyst (0.8), project (1)]
}。
D) the comprehensive weight w of each word is calculated according to formula (1)i
E) result of calculation of formula (1) is substituted into formula (3), recalculates the weight of each word, counted according to formula (3) Calculate, until convergence.
It is that the key technology of problem, Fig. 7 are extracted from different document that problem document content, which is extracted with semantic fusion technology, Shown in parsing key technology flow extract the key technology of problem respectively from following components.
1) title of key technology is extracted from key technology document, development final report, and title is filtered, Filter word is substituted for null character string, key technology is used as if remaining string length is more than 5;
2) semantic-based analysis, analysis of key technical documentation, develop text under final report technology, band behind searching The word of technology, whether the word segmentation result before judgment technology is noun, name verb, gerundial form;
3) as the method in analysis problem direction, the text under key technology document, development final report technology, seeks A verb combination, or gerund combination are looked for, and is text key word, then as key technology.
The title for wherein extracting key technology from key technology document, development final report by counting may occur in which influence The word of key technology, and some titles are not technology titles, are exerted a certain influence to extracting key technology, if gone out These existing words then need respectively to filter these words.
What is be stored in database is basic direction, key technology, key technical index, passes through artificial commending system, increases Add direction, key technology, the accuracy of index.By the basic direction of problem in reading database, to direction under same domain Cluster operation is carried out, the cluster of key technology is carried out under same domain similarity direction, ultimately generates field-direction-problem-key Contingency table between technology-key technical index.
Because will to problem direction, key technology carry out merging, by same kind, it is similar be merged together, so Need the calculating using similarity.So using a variety of similarity algorithms, short text similarity measure is carried out, it is similar to calculate text The average of degree, the high text of similarity is merged together.
1) simply shared word
It is to be calculated by the character total number of word of document with most lengthy document number of characters that simply shared word, which calculates similarity, specifically Way is that total number of characters of the word shared with document divided by most lengthy document number of characters, result of calculation are used for assess similarity.
2) editing distance
The method is mainly the number converted between calculating character string, calculates a character string and is converted into another character The number of operations of string, if number of operations is excessive, illustrates that the degree of conversion is very big, it is small to further relate to its similarity-rough set, instead Similarity it is big.
3) cosine similarity
The method reflects similarity degree mainly by calculating cosine value, by the degree of angle, if vector angle Cosine value is big, then illustrates that similarity degree is low, and back-to-front ratio illustrates relatively.
4) Jaccard likeness coefficients
Jaccard similarity measures, are the calculating by set, and two sentences are respectively divided among two set, With set intersection divided by set union, calculate the relation between them with the method.
By the basic direction of subject, key technology, key technical index in reading database, problem is carried out same The union operation in the direction under field, it is equidirectional under key technology union operation.Same domain is carried out for new problem direction Under problem direction similarity measure, for new key technology carry out it is equidirectional under key technology similarity measure, take Maximum max, if max exceedes a certain threshold value, then it is assumed that should be under this problem direction or key technology, as this problem Direction or the alternate item of key technology.If max is not above threshold value, continue to add problem side under this field, under direction To, key technology.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Member can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be subject to described in claims.

Claims (10)

1. a kind of knowledge mapping network establishing method based on magnanimity scientific research data, its step include:
1) project library is parsed, the basic information of problem is extracted and parses the document information of each problem, problem direction;Its In, the basic information includes problem fields;
2) descriptor is extracted from the heading message of the setting document of each problem, the problem key technology as corresponding problem;
3) the problem direction for belonging to same area is clustered;For the problem in same cluster result, from the one of each problem As charter, demand analysis explanation in parse key technical index, then according to the problem of each problem in same cluster result The degree of correlation of key technology and key technical index, problem key technology is associated with key technical index, is formed each Key technology corresponds to some key technical index;Ultimately generate field-problem direction-problem-key technology-key technical index Between contingency table, i.e., the knowledge mapping of described project library.
2. the method as described in claim 1, it is characterised in that the method for parsing the problem direction is:
1) the destination document content of problem is segmented, digraph is formed according to word segmentation result;
2) for each participle point V in the digraphi, use formulaCalculate participle point ViFinal weight S(vi);Wherein, ln (Vi) it is to be directed toward participle point ViParticiple point set, Out (Vj) it is participle point ViIt is directed toward other participle points Set, d is adjustment factor, wjiTo segment point vjTo participle point viSide weight, wiFor participle point ViComprehensive weight;
3) problem direction of some participles as the problem is chosen according to the final weight of participle point.
3. method as claimed in claim 2, it is characterised in that participle point ViComprehensive weight wi=w1*Ai+w2*Bi+w3*Ci +w4*Di;Wherein, AiTo segment point ViTF-IDF, its weight is w1;BiRepresent participle point ViPosition, its weight is w2, CiGeneration Partitive point ViPart of speech, its weight is w3, DiRepresent participle point ViLength, its weight is w4
4. the method as described in claim 1, it is characterised in that the method for obtaining the problem key technology is:From problem Key technology document, develop final report title in extract descriptor, the key technology as the problem;Semantic-based point Analysis, analysis of key technical documentation, the text for developing final report technology, find the word of band " technology " in text, judge " skill Whether the word segmentation result before art " is noun, name verb or gerundial form, if it is, point by " technology " and its above Key technology of the word result as the problem;Key technology document, develop final report technology text in find name verb Combination or gerund combination, and be text key word, then as key technology.
5. the method as described in claim 1, it is characterised in that problem direction, key technology are merged respectively, will be of the same race Type, similar problem direction are merged together, and same kind, similar key technology are merged together.
6. method as claimed in claim 5, it is characterised in that carry out the problem direction under same domain for new problem direction Similarity measure, be maximized max, if max exceedes given threshold K, by the new problem direction as with its most phase Like the alternate item in problem direction, otherwise the new problem direction is added under corresponding field;Carried out for new key technology same Key technology similarity measure under problem direction, is maximized max, if max exceedes given threshold G, by the new pass Otherwise key technology adds the new key technology as the alternate item with its most like key technology under corresponding problem direction.
7. the method as described in claim 1~6 is any, it is characterised in that the basic information further includes project number, problem Title, contract number, carrier, problem level of confidentiality, requisition number, sponsor sections, sponsor sections counterpart people, scientific and technical department director And participant;The document information include project number, problem title, achievement form, main research, scientific research personnel, Key technology and key technical index.
A kind of 8. knowledge mapping network building systems based on magnanimity scientific research data, it is characterised in that including problem parsing module, Problem key technology extraction module and knowledge mapping generation module;Wherein,
Problem parsing module, for parsing project library, extracts the basic information of problem and parses the document of each problem Information, problem direction;Wherein, the basic information includes problem fields;
Problem key technology extraction module, for extracting descriptor from the heading message of the setting document of each problem, as The problem key technology of corresponding problem;
Knowledge mapping generation module, for being clustered to the problem direction for belonging to same area;For in same cluster result Problem, general charter from each problem, parse key technical index in demand analysis explanation, then according to same cluster As a result the problem key technology of each problem and the degree of correlation of key technical index in, to problem key technology and key technical index It is associated, forms each key technology and correspond to some key technical index;Ultimately generate field-problem direction-problem-key Contingency table between technology-key technical index, i.e., the knowledge mapping of described project library.
9. system as claimed in claim 8, it is characterised in that the problem parsing module to the destination document content of problem into Row participle, digraph is formed according to word segmentation result;Then for each participle point V in the digraphi, use formulaCalculate participle point ViFinal power Weight S (vi);Wherein, ln (Vi) it is to be directed toward participle point ViParticiple point set, Out (Vj) it is participle point ViIt is directed toward other participles The set of point, d is adjustment factor, wjiTo segment point vjTo participle point viSide weight, wiFor participle point ViSynthetic weights Weight;Then problem direction of some participles as the problem is chosen according to the final weight of participle point.
A kind of 10. computer-readable recording medium for storing computer program, it is characterised in that storage computer program, it is described Computer program includes instruction, and described instruction is included such as each step in any one of claim 1 to 7.
CN201710928133.9A 2017-10-09 2017-10-09 A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data Pending CN107967290A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710928133.9A CN107967290A (en) 2017-10-09 2017-10-09 A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710928133.9A CN107967290A (en) 2017-10-09 2017-10-09 A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data

Publications (1)

Publication Number Publication Date
CN107967290A true CN107967290A (en) 2018-04-27

Family

ID=61997426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710928133.9A Pending CN107967290A (en) 2017-10-09 2017-10-09 A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data

Country Status (1)

Country Link
CN (1) CN107967290A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087053A (en) * 2018-06-01 2018-12-25 平安科技(深圳)有限公司 Synergetic office work processing method, device, equipment and medium based on associated topologies figure
CN109241278A (en) * 2018-07-18 2019-01-18 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN110119473A (en) * 2019-05-23 2019-08-13 北京金山数字娱乐科技有限公司 A kind of construction method and device of file destination knowledge mapping
CN111126034A (en) * 2019-12-17 2020-05-08 南京医基云医疗数据研究院有限公司 Medical variable relation processing method and device, computer medium and electronic equipment
CN112800243A (en) * 2021-02-04 2021-05-14 天津德尔塔科技有限公司 Project budget analysis method and system based on knowledge graph
CN113569060A (en) * 2021-09-24 2021-10-29 中国电子技术标准化研究院 Standard text based knowledge graph disambiguation method, system, device and medium
CN113642031A (en) * 2021-10-15 2021-11-12 中国铁道科学研究院集团有限公司科学技术信息研究所 Subject acceptance method and system
CN115186111A (en) * 2022-09-13 2022-10-14 中国医学科学院医学信息研究所 Index data semantic association and fusion method, system and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760058A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Massive software project sharing method oriented to large-scale collaborative development
CN103927360A (en) * 2014-04-18 2014-07-16 北京大学 Software project semantic information presentation and retrieval method based on graph model
CN106205248A (en) * 2016-08-31 2016-12-07 北京师范大学 A kind of representative learning person generates system and method at the on-line study cognitive map of domain-specific knowledge learning and mastering state
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN107193870A (en) * 2017-04-12 2017-09-22 广东万丈金数信息技术股份有限公司 The extracting method and system of web page contents

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760058A (en) * 2012-04-05 2012-10-31 中国人民解放军国防科学技术大学 Massive software project sharing method oriented to large-scale collaborative development
CN103927360A (en) * 2014-04-18 2014-07-16 北京大学 Software project semantic information presentation and retrieval method based on graph model
CN106205248A (en) * 2016-08-31 2016-12-07 北京师范大学 A kind of representative learning person generates system and method at the on-line study cognitive map of domain-specific knowledge learning and mastering state
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN107193870A (en) * 2017-04-12 2017-09-22 广东万丈金数信息技术股份有限公司 The extracting method and system of web page contents

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王炎: "基于多数据源的专家学术网络构建及其应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
葛斌等: "基于无向图构建策略的主题句抽取", 《计算机科学》 *
陈兴元等: "科研活动与知识图谱关系的探讨", 《无线互联科技》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087053A (en) * 2018-06-01 2018-12-25 平安科技(深圳)有限公司 Synergetic office work processing method, device, equipment and medium based on associated topologies figure
CN109087053B (en) * 2018-06-01 2023-05-09 平安科技(深圳)有限公司 Collaborative office processing method, device, equipment and medium based on association topological graph
CN109241278A (en) * 2018-07-18 2019-01-18 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN109241278B (en) * 2018-07-18 2022-04-26 绍兴诺雷智信息科技有限公司 Scientific research knowledge management method and system
CN110119473A (en) * 2019-05-23 2019-08-13 北京金山数字娱乐科技有限公司 A kind of construction method and device of file destination knowledge mapping
CN111126034A (en) * 2019-12-17 2020-05-08 南京医基云医疗数据研究院有限公司 Medical variable relation processing method and device, computer medium and electronic equipment
CN111126034B (en) * 2019-12-17 2023-09-19 南京医基云医疗数据研究院有限公司 Medical variable relation processing method and device, computer medium and electronic equipment
CN112800243A (en) * 2021-02-04 2021-05-14 天津德尔塔科技有限公司 Project budget analysis method and system based on knowledge graph
CN113569060A (en) * 2021-09-24 2021-10-29 中国电子技术标准化研究院 Standard text based knowledge graph disambiguation method, system, device and medium
CN113642031A (en) * 2021-10-15 2021-11-12 中国铁道科学研究院集团有限公司科学技术信息研究所 Subject acceptance method and system
CN115186111A (en) * 2022-09-13 2022-10-14 中国医学科学院医学信息研究所 Index data semantic association and fusion method, system and storage medium

Similar Documents

Publication Publication Date Title
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
US8676815B2 (en) Suffix tree similarity measure for document clustering
US9715493B2 (en) Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
Inzalkar et al. A survey on text mining-techniques and application
Lin et al. An integrated approach to extracting ontological structures from folksonomies
CN110019689A (en) Position matching process and position matching system
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
Das et al. A CV parser model using entity extraction process and big data tools
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
Gong et al. Phrase-based hashtag recommendation for microblog posts.
KR101476225B1 (en) Method for Indexing Natural Language And Mathematical Formula, Apparatus And Computer-Readable Recording Medium with Program Therefor
CN109902230A (en) A kind of processing method and processing device of news data
Çelebi et al. Automatic question answering for Turkish with pattern parsing
Tran et al. A named entity recognition approach for tweet streams using active learning
Aljević et al. Extractive text summarization based on selectivity ranking
Ahmed et al. Building multiview analyst profile from multidimensional query logs: from consensual to conflicting preferences
CN112711695A (en) Content-based search suggestion generation method and device
Zeng et al. Construction of scenic spot knowledge graph based on ontology
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction
Singh et al. EfficientPMM: Finite Automata Based Efficient Pattern Matching Machine
Magnini et al. Entailment graphs for text analytics in the excitement project
Bernardes et al. Exploring NPL: Generating Automatic Control Keywords

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20220624

AD01 Patent right deemed abandoned