CN107967290A - A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data - Google Patents
A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data Download PDFInfo
- Publication number
- CN107967290A CN107967290A CN201710928133.9A CN201710928133A CN107967290A CN 107967290 A CN107967290 A CN 107967290A CN 201710928133 A CN201710928133 A CN 201710928133A CN 107967290 A CN107967290 A CN 107967290A
- Authority
- CN
- China
- Prior art keywords
- key
- key technology
- point
- technology
- participle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention discloses a kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data.This method is:Project library is parsed, the basic information of problem is extracted and parses the document information of each problem, problem direction;Descriptor is extracted from the heading message of the setting document of each problem, the problem key technology as corresponding problem;The problem direction for belonging to same area is clustered;For the problem in same cluster result, key technical index is parsed in general charter, demand analysis explanation from each problem, then according to the problem key technology of each problem in same cluster result and the degree of correlation of key technical index, problem key technology is associated with key technical index, each key technology is formed and corresponds to some key technical index;Ultimately generate the knowledge mapping of the contingency table, i.e. project library between the problem key technology key technical index of field problem direction.The present invention, which can analyze, extracts future technical advances trend.
Description
Technical field
It is a kind of knowledge mapping based on magnanimity scientific research data the present invention relates to scientific research data information extraction classification field.
Background technology
The characteristics of scientific research knowledge profile information extracts be using clustering technique complete problem direction, key technology classification with
Association, scientific research knowledge collection of illustrative plates are also equipped with any switching laws function between version, divide field, direction, key technical index, key technology etc.
Different angle is sorted out, and the operation such as node motion, merging can be carried out between its different angle, different directions.
What the application demand of the present invention was intended to automate carries out information extraction to scientific research document, so needs pair
The parsing that history problem document and the following problem document newly imported are automated, extracts necessary category in problem automatically
Property information, and possess inspection institute analyzing subsystem, user can retrieve related problem document, displaying user institute by retrieval
The topic information of care, carrys out the saving human resources of great dynamics with this, increases work efficiency.
There are problems that in the problem document information of magnanimity, on the one hand lack the incidence relation between problem, annual scientific research
Highest priority highlights deficiency;On the other hand content depth analysis is lacked, it is difficult to play history problem to grasping key technology research
The effect of progress and technical merit.For these deficiencies, the present invention works out improved TextRank algorithm and clustering technique, and
And reach desired effect.
The content of the invention
For problems of the prior art, it is an object of the invention to provide a kind of knowing based on magnanimity scientific research data
Know collection of illustrative plates network establishing method and system, medium.The present invention grasps key technology research by the excavation to history problem data
The technical merit be in progress, reached, analysis extract future technical advances trend.
The technical scheme is that:
A kind of knowledge mapping network establishing method based on magnanimity scientific research data, its step include:
1) project library is parsed, the basic information of problem is extracted and parses the document information of each problem, problem side
To;Wherein, the basic information includes problem fields;
2) descriptor is extracted from the heading message of the setting document of each problem, the problem key skill as corresponding problem
Art;
3) the problem direction for belonging to same area is clustered;For the problem in same cluster result, from each problem
General charter, demand analysis explanation in parse key technical index, then according to each problem in same cluster result
The degree of correlation of problem key technology and key technical index, problem key technology is associated with key technical index, is formed
Each key technology corresponds to some key technical index;Ultimately generate field-problem direction-problem-key technology-key technology
Contingency table between index, i.e., the knowledge mapping of described project library.
Further, the method for parsing the problem direction is:
1) the destination document content of problem is segmented, digraph is formed according to word segmentation result;
2) for each participle point V in the digraphi, use formulaCalculate participle point ViFinal power
Weight S (vi);Wherein, ln (Vi) it is to be directed toward participle point ViParticiple point set, Out (Vj) it is participle point ViIt is directed toward other participles
The set of point, d is adjustment factor, wjiTo segment point vjTo participle point viSide weight, wiFor participle point ViSynthetic weights
Weight;
3) problem direction of some participles as the problem is chosen according to the final weight of participle point.
Further, participle point ViComprehensive weight wi=w1*Ai+w2*Bi+w3*Ci+w4*Di;Wherein, AiFor participle point
ViTF-IDF, its weight is w1;BiRepresent participle point ViPosition, its weight is w2, CiRepresent participle point ViPart of speech, its weigh
Weight is w3, DiRepresent participle point ViLength, its weight is w4。
Further, the method for obtaining the problem key technology is:Key technology document, development from problem summarize report
Descriptor is extracted in the title of announcement, the key technology as the problem;Semantic-based analysis, analysis of key technical documentation, grind
The text of final report technology processed, find text in band " technology " word, judge " technology " before word segmentation result whether be
Noun, name verb or gerundial form, if it is, crucial skill of the word segmentation result as the problem by " technology " and its above
Art;Key technology document, develop final report technology text in find name verb combination or gerund combination, and for text
This keyword, then as key technology.
Further, problem direction, key technology are merged respectively, same kind, similar problem direction is closed
And to together, same kind, similar key technology are merged together.
Further, the similarity measure in the problem direction under same domain is carried out for new problem direction, is maximized
Max, it is no by the new problem direction as the alternate item with its most like problem direction if max exceedes given threshold K
Then the new problem direction is added under corresponding field;Carried out for new key technology with the key technology phase under problem direction
Like degree calculate, be maximized max, if max exceedes given threshold G, by the new key technology as with its most like pass
The alternate item of key technology, otherwise adds the new key technology under corresponding problem direction.
Further, it is close to further include project number, problem title, contract number, carrier, problem for the basic information
Level, requisition number, sponsor sections, sponsor sections counterpart people, scientific and technical department director and participant;The document information includes
Project number, problem title, achievement form, main research, scientific research personnel, key technology and key technical index.
A kind of knowledge mapping network building systems based on magnanimity scientific research data, it is characterised in that parse mould including problem
Block, problem key technology extraction module and knowledge mapping generation module;Wherein,
Problem parsing module, for parsing project library, extracts the basic information of problem and parses each problem
Document information, problem direction;Wherein, the basic information includes problem fields;
Problem key technology extraction module, for extracting descriptor from the heading message of the setting document of each problem,
Problem key technology as corresponding problem;
Knowledge mapping generation module, for being clustered to the problem direction for belonging to same area;For same cluster knot
Problem in fruit, general charter, demand analysis from each problem parse key technical index in illustrating, then according to same
The problem key technology of each problem and the degree of correlation of key technical index in cluster result, to problem key technology and key technology
Index is associated, and is formed each key technology and is corresponded to some key technical index;Ultimately generate field-problem direction-problem-
Contingency table between key technology-key technical index, i.e., the knowledge mapping of described project library.
A kind of computer-readable recording medium for storing computer program, it is characterised in that storage computer program, it is described
Computer program includes instruction, and described instruction includes each step in any of the above-described method.
The information extraction of scientific research knowledge collection of illustrative plates in the present invention, directly extracts fixed attribute by project library file respectively,
The attribute being not present in project library is parsed by problem document.Data are pre-processed, remove noise, parses and meets this
The data of project specification, data are stored in database.Problem direction is parsed, each problem there should be one or more problem side
To by computer program, to the title of problem, main research is parsed, and the descriptor extracted is as this problem
Direction.Synonymous different word problems are solved by near synonym storehouse, by synonymous or equidirectional cluster together.Key technology
Parsing cluster, key technology is extracted based on word frequency, the meaning of a word.By the calculating of thesaurus and similarity, fixed threshold is set,
The similarity between key technology is calculated, key technology is clustered, by the high cluster of the degree of correlation together, to reduce key
The species of technology.Engine reads configuration file first, monitors the file under fixed catalogue by configuration file, parses project library,
Problem basic information in project library is stored.The problem that existing history problem document, reception are new is handled respectively
Document two parts.Information extraction is carried out to problem document, is divided into essential information and problem direction, key technology information, passes through people
Work candidate system improves direction and the key technology of extraction.Similarity measure finally is carried out for problem direction, key technology, it is complete
Kind contingency table.
Scientific research knowledge profile information extraction module is broadly divided into 3 parts:WEB front-end, WEB rear ends and engine.WEB front-end
Interacted with user, the user's operation page, request is sent to rear end.Rear end is mainly the processing to business and data, is plucked
Want extract function.Engine section is responsible for the information extraction of scientific documents, the extraction and problem direction to basic information and key
The merging of technology, mainly to for database, the processing to data.Shown in scientific research knowledge collection of illustrative plates system assumption diagram (such as Fig. 1).
Engine reads configuration file first, monitors the file under fixed catalogue by configuration file, project library is parsed, by project library
Problem basic information is stored.Problem document two parts that existing history problem document, reception are new are handled respectively.It is right
Problem document carries out information extraction, is divided into essential information and problem direction, key technology information extraction, and pass through artificial candidate
The direction and key technology that system perfecting extracts.The key technology used is as follows:Data analysis and data mining;Lightweight web
Frame SSI;Apache POI and PDFBox.
User uses notebook operation front end system;Front and rear end communicates, and rear end carries out business processing, the business of rear end
Processing with database by interacting, and server at a high speed by engine realized by retrieval;Engine is responsible for parsing data, extracts letter
Breath, is retrieved at a high speed.Rear end is communicated with engine by Socket, to improve the speed of transmission data.Knowledge mapping Organization Chart is (as schemed
2).System receives data entirely through a server, receives document, parsing document is carried out by server, and document is carried out
Information extraction, the necessary information of problem is stored, and is used for remaining module.The browser access system that user passes through notebook
System, the deployment diagram (such as Fig. 3) of knowledge mapping entirety.
Knowledge mapping can carry out sequence of operations, such as locking, breviary, newly-increased, deletion and search, or increase version or delete
Except version.Knowledge mapping classification is followed successively by the different zones such as version, field, direction, key technology, key technical index, each
There are multiple nodes in region, and each node can be increased newly, deleted, changing, is mobile.Such as one of version is selected, this version bag
Containing multiple fields, multiple directions are included under each field, each direction node includes multiple key technologies, each is crucial
Technology includes multiple key technical index.
1) project library is parsed
By parsing project library, extracting some basic informations of problem, (project number, problem title, contract number, undertake
Unit, problem level of confidentiality, requisition number, fields, sponsor sections, sponsor sections counterpart people, scientific and technical department director, participant
Member etc.), store in database, state is arranged to 0.Parse project achievement logical flow chart (such as Fig. 4).
2) document information of each problem is parsed
All documents in each problem are read, it is incomplete there are document content to solve the problems, such as, carried by multiple documents
Take content information (project number, problem title, achievement form, main research, scientific research personnel, key technology, key technology
Index etc.), to ensure the integrality of data.Document classification process of analysis figure is as shown in Figure 5.FilterMap collection is initialized first
Close, File.listFiles () obtains file or file under this problem, traveled through.It is stored in by file type
In FilterMap set, (general charter key is 1 to fixed number, and statement of requirements book is 2), to belong to because first having to extract to include
Property most file, so to there is order.If in file deposit FileList, if there is acceptance document, only deposit and test
Message in-coming shelves, are standard with examination.It is stored in by file type in fileMap.Document is parsed, generates analysis result basic,
Project number is searched in projectMap, if there is storehouse is directly updated the data, projectMap removes the note of this project number
Record;If it does not, being still acceptance document, storehouse is also updated the data, projectMap removes the record of this project number, this class
Inscribe document deposit retrieval subsystem.
3) the problem direction of each problem is parsed
Each problem has one or more problem direction, and a problem extracts the direction of problem from multiple documents.
Firstly the need of being segmented to document by segmenter, stop words is removed, noun, verb are only according to investigation problem direction
Combination, arbitrarily only retains noun, verb, the word of adjective part of speech.Extract General layout Plan such as Fig. 6 in specific problem direction
It is shown.Problem title, main research are parsed by computer program, is taken out and counted using improved TextRank algorithm
The keyword of text is calculated, and problem direction is determined with reference to semantic analysis.
4) the problem key technology of each problem is parsed
Reported first by the key technology research of problem, the heading message in development final report, extract descriptor, then
Using the text analyzing based on word frequency, part of speech, semanteme, descriptor is parsed as key technology.Parse key technology flow chart
As shown in Figure 7.
5) key technical index is parsed
Index Content is analyzed in general charter, demand analysis explanation to problem, parses key technical index,
By Full-Text Index Model, regard key technology as retrieval word, technical indicator regards document as, by key technology and crucial skill
Art index carries out degree of correlation comparison, will be associated with the relevant index of key technology, formed each key technology correspond to zero to
Multiple key technical index.It is as shown in Figure 8 that key technical index extracts flow chart.
6) merging in problem direction, key technology
Field-problem direction-key technology-key technical index is considered as to the knowledge mapping of an entirety, and by phase
As problem direction, key technology be merged together, beneficial to problem direction under observation field, the development trend of key technology.Institute
During storing topic information, the association of fields, problem direction, key technology, key technical index is carried out.By similarity height
Problem direction, key technology merges (such as Fig. 9).There are multiple directions under field, there are multiple key technologies under direction,
Key technology has multiple key technical index, forms the relation of multiple multi-to-multis.
Compared with prior art, the positive effect of the present invention is:
In information extraction application of function of the present invention, file monitoring modular correctly can be monitored file, Yi Jisheng
Into destination file (.ok);Problem document parsing module can successfully parse document, and the problem to not having in storehouse into
Row prompting;The merging in problem direction, key technology also corresponds to expected results, can be to similar problem direction, key technology
Merge.In information extraction performance test, different size of problem document is parsed, resolution speed does not substantially become
Change, it was demonstrated that resolution speed is not directly dependent upon with document size;Percentage of head rice test is parsed, more than 90%, meet needs parsing attribute
Ask.Server end IP address is needed to configure when logging in for the first time, business personnel need to select data source when uploading text, click on importing and be
Success can be uploaded, checks the historical record that history display data import.Multilingual is supported to import in link, content.User can
To carry out sequence of operations, locking, breviary, newly-increased, deletion and search, or increase version on scientific research knowledge collection of illustrative plates or delete version
This.The knowledge tree classification in each region is followed successively by version, field, direction, key technology, key technical index, and user can be to every
A node is operated, and each node can be increased newly, deleted, changing, is mobile.Navigation bar can show all version informations, click on
Version switching knowledge tree is the version;Mobile node can be placed first at navigation bar, the buffer area as node motion;It is user controllable
Omniselector processed come change knowledge tree displaying position.After user clicks on node, right side area shows the associated all classes of the node
Topic, user can select the problem for selecting to want the viewing corresponding time in frame in the time in the region.
Brief description of the drawings
Fig. 1 is scientific research knowledge collection of illustrative plates system assumption diagram;
Fig. 2 is knowledge mapping Organization Chart;
Fig. 3 is knowledge mapping integral deployment figure;
Fig. 4 is parsing project library logic chart;
Fig. 5 is document classification resolution logic figure;
Fig. 6 extracts overall design drawing for problem direction;
Fig. 7 is parsing key technology logical flow chart;
Fig. 8 extracts flow chart for technical indicator;
Fig. 9 is merging flow chart;
Figure 10 is information extraction flow chart.
Embodiment
To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's attached drawing to make
Describe in detail as follows.
Documentation & info in scientific research knowledge collection of illustrative plates, which extracts, needs reading project library file, it is parsed, by problem base
In plinth information deposit database, and parse problem document, extract other important informations of problem, and by candidate system into
Row processing, finally by complete insertion of data into data storehouse, and carries out similarity measure, by phase to problem direction, key technology
Merged like the high problem direction of degree, key technology, be merged together of a sort, ultimately form field-direction-class
Incidence relation between topic-key technology.
Problem document, which extracts collectivity Scheme Design, includes at following 9 points:
1) catalogue in stored items storehouses is monitored, if there is untreated project library file or new projects library file,
Then parsed, the essential information of problem is deposited into database.It is to be monitored in deposit database, if in database
Without the topic information of this project number, then it is deposited into database, state status is arranged to 0, is not otherwise put in storage.
2) all status are inquired about from database (only to extract basic information from project library for 0, do not carry out key message
Parsing) data, by (project number, id) deposit system global variables Map.
3) untreated document (file exists, but no corresponding .ok files) is found, untreated document is stored in
In queue.
4) untreated document is if project library file.Untreated project library file is parsed, topic information batch is deposited
Enter in database, and by project number, id deposit system global variables Map.Successful then generation .ok files.
5) untreated document is if problem document.It is deposited into etc. in pending queue, waits the thread of free time.
6) file is monitored by FileMonitor, new add file folder (scientific research document) is deposited into etc. pending
In queue.
7) thread monitor queue, if queue is not sky, queue heads place to go is parsed, if the numbering of problem
Existing in Map, then renewal is arranged to 2 into database, and by the state status of this problem in database, while by class
Topic information is transmitted to searching system.
8) by artificial treatment, verification, modify to the basic data of problem, update the basic data (problem of problem
Direction, key technology), state is arranged to 1, is end-state.
9) problem table is handled, the problem basic information after arrangement is handled, is inserted respectively into field table, problem direction
Table, key technology table, key technical index table, field-directional correlation table, direction-problem contingency table, key technology-problem are closed
Join table, state is respectively set to 1.
It is as shown in Figure 10 that Global Information extracts flow chart.
Problem direction utilizes TextRank algorithm, is improved for the deficiency of TextRank algorithm, using based on
The innovatory algorithm of the TextRank of comprehensive weight, using " comprehensive weight " of G1 enabling legislations, respectively calculating word, and will
TextRank algorithm is improved the keyword for calculating text based on comprehensive weight.Problem direction, which is extracted, to be needed to multiple documents
Main research is handled, it is necessary first to document is segmented by segmenter, removes stop words, according to investigation problem
Direction is only noun, dynamic contamination, only retains noun, verb, the word of adjective part of speech.The system uses HanLP's
Words partition system, can not only be segmented by HanLP, but also can obtain the part of speech of word segmentation result at the same time.At the same time using improved
TextRank algorithm extracts the keyword of problem.In addition to main research carry out word segmentation processing, with reference to keyword to its into
The semantic participle of row, determines final problem direction.
TextRank algorithm is the algorithm of a kind of keyword, key words extraction, and the calculating based on figure, extracts from text
Its keyword, descriptor, summarize the research contents of text.The algorithm is the power each put in calculating figure respectively based on scheming
Weight, and weight is equally influenced by remaining point, and so the weight of dictionary is bigger, then the weight of coupled point is also got over
Greatly.Comprehensive weight calculation formula is shown in formula (1):
wi=w1*Ai+w2*Bi+w3*Ci+w4*Di (1)
A:TF-IDF is represented, TF means word frequency (Term Frequency), and IDF means reverse document-frequency.B:Generation
The position of table word, in beginning of the sentence, end of the sentence, sentence.C:Represent part of speech.D:Represent word length.W:Represent respective weight.
Textrank algorithms are exactly that this paper word segmentation results are formed digraph, if it is V that G (V, E), which is point set, side integrates as E, and
E is the digraph of V × V subsets, if for some participle point V in figurei, it is all be directed toward this point set be denoted as ln (Vi),
And pass through this participle point ViThe collection for being directed toward other points is combined into Out (Vj), then, participle point ViWeights can be counted by formula (2)
Calculate:
D is adjustment factor, generally takes 0.85.
But TextRank algorithm assumes that and weights influence is not present between points, that is, the important journey each put
Spend it is identical, it is but in the text, really not so.So the present invention needs to carry out weight calculation to different points, give important
The certain big weights of point, increase its weight.I.e. calculation formula is changed into (3):
wjiFor point vjTo point viSide weight.
Improved TextRank extracting keywords algorithm:
1) segmenter is used, main research is segmented, the collection of all words is combined into point set V, and according to participle
As a result associating between word and word is carried out, the relation on the side between word is established, establishes corresponding side E;
2) use formula (2) set of computations V in all the points weight, recursive calculation, until final calculation result restrain untill,
Stop calculating;
3) after calculating the weight each put, descending sort is carried out to the weight of point, a certain range of phrase is chosen and does text
This keyword.
The specific algorithm of improved TextRank is as follows:
A) carry out stop words to text to handle, the results set after being handled.
Such as:[software, personnel, programmer is advanced, programmer, system, analyst, project, manager]
B) each word segmentation result takes front and rear 5 words, and other words can be associated by being denoted as.
{
Software=[personnel, programmer is advanced, system],
Personnel=[software, programmer, program, system, analyst],
Programmer=[software, personnel are advanced, programmer, system, analyst, project, manager],
Advanced=[software, personnel, programmer, system, analyst, project, manager],
System=[software, personnel are advanced, programmer, analyst, project, manager],
Analyst=[personnel, advanced, programmer, system, project, manager],
Project=[advanced, programmer, system, analyst, manager],
Manager=[advanced, programmer, system, analyst, project]
}。
C) weight of each word is calculated according to the distance between word and word, i.e., according to calculate word i and front and rear 5 words away from
From calculating the weight of word i.Distance calculates weight equation wji=(5-k+1)/5, k is the word number away from this word, dittograph language
It is averaged.
{
Software=[personnel (1), programmer (0.8), advanced (0.6), system (0.2)],
Personnel=[software (1), programmer (0.8), advanced (0.8), system (0.4), analyst (0.2)],
Programmer=[software (0.8), personnel (1), advanced (1), and programmer (0.8), system (0.8), analyst (0.6),
Project (0.4), is handled (0.2)],
Advanced=[software (0.6), personnel (0.8), programmer (1), system (0.8), analyst (0.6), project (0.4),
Handle (0.2)],
System=[software (0.2), personnel (0.4), advanced (0.8), and programmer (0.8), analyst (1), project (0.8),
Handle (0.6)],
Analyst=[personnel (0.2), advanced (0.6), programmer (0.6), system (1), project (0.8), manager
(0.6)],
Project=[advanced (0.4), programmer (0.4), system (0.8), analyst (1), handles (1)],
Manager=[advanced (0.2), programmer (0.4), system (0.6), analyst (0.8), project (1)]
}。
D) the comprehensive weight w of each word is calculated according to formula (1)i。
E) result of calculation of formula (1) is substituted into formula (3), recalculates the weight of each word, counted according to formula (3)
Calculate, until convergence.
It is that the key technology of problem, Fig. 7 are extracted from different document that problem document content, which is extracted with semantic fusion technology,
Shown in parsing key technology flow extract the key technology of problem respectively from following components.
1) title of key technology is extracted from key technology document, development final report, and title is filtered,
Filter word is substituted for null character string, key technology is used as if remaining string length is more than 5;
2) semantic-based analysis, analysis of key technical documentation, develop text under final report technology, band behind searching
The word of technology, whether the word segmentation result before judgment technology is noun, name verb, gerundial form;
3) as the method in analysis problem direction, the text under key technology document, development final report technology, seeks
A verb combination, or gerund combination are looked for, and is text key word, then as key technology.
The title for wherein extracting key technology from key technology document, development final report by counting may occur in which influence
The word of key technology, and some titles are not technology titles, are exerted a certain influence to extracting key technology, if gone out
These existing words then need respectively to filter these words.
What is be stored in database is basic direction, key technology, key technical index, passes through artificial commending system, increases
Add direction, key technology, the accuracy of index.By the basic direction of problem in reading database, to direction under same domain
Cluster operation is carried out, the cluster of key technology is carried out under same domain similarity direction, ultimately generates field-direction-problem-key
Contingency table between technology-key technical index.
Because will to problem direction, key technology carry out merging, by same kind, it is similar be merged together, so
Need the calculating using similarity.So using a variety of similarity algorithms, short text similarity measure is carried out, it is similar to calculate text
The average of degree, the high text of similarity is merged together.
1) simply shared word
It is to be calculated by the character total number of word of document with most lengthy document number of characters that simply shared word, which calculates similarity, specifically
Way is that total number of characters of the word shared with document divided by most lengthy document number of characters, result of calculation are used for assess similarity.
2) editing distance
The method is mainly the number converted between calculating character string, calculates a character string and is converted into another character
The number of operations of string, if number of operations is excessive, illustrates that the degree of conversion is very big, it is small to further relate to its similarity-rough set, instead
Similarity it is big.
3) cosine similarity
The method reflects similarity degree mainly by calculating cosine value, by the degree of angle, if vector angle
Cosine value is big, then illustrates that similarity degree is low, and back-to-front ratio illustrates relatively.
4) Jaccard likeness coefficients
Jaccard similarity measures, are the calculating by set, and two sentences are respectively divided among two set,
With set intersection divided by set union, calculate the relation between them with the method.
By the basic direction of subject, key technology, key technical index in reading database, problem is carried out same
The union operation in the direction under field, it is equidirectional under key technology union operation.Same domain is carried out for new problem direction
Under problem direction similarity measure, for new key technology carry out it is equidirectional under key technology similarity measure, take
Maximum max, if max exceedes a certain threshold value, then it is assumed that should be under this problem direction or key technology, as this problem
Direction or the alternate item of key technology.If max is not above threshold value, continue to add problem side under this field, under direction
To, key technology.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area
Member can be to technical scheme technical scheme is modified or replaced equivalently, without departing from the spirit and scope of the present invention, this hair
Bright protection domain should be subject to described in claims.
Claims (10)
1. a kind of knowledge mapping network establishing method based on magnanimity scientific research data, its step include:
1) project library is parsed, the basic information of problem is extracted and parses the document information of each problem, problem direction;Its
In, the basic information includes problem fields;
2) descriptor is extracted from the heading message of the setting document of each problem, the problem key technology as corresponding problem;
3) the problem direction for belonging to same area is clustered;For the problem in same cluster result, from the one of each problem
As charter, demand analysis explanation in parse key technical index, then according to the problem of each problem in same cluster result
The degree of correlation of key technology and key technical index, problem key technology is associated with key technical index, is formed each
Key technology corresponds to some key technical index;Ultimately generate field-problem direction-problem-key technology-key technical index
Between contingency table, i.e., the knowledge mapping of described project library.
2. the method as described in claim 1, it is characterised in that the method for parsing the problem direction is:
1) the destination document content of problem is segmented, digraph is formed according to word segmentation result;
2) for each participle point V in the digraphi, use formulaCalculate participle point ViFinal weight
S(vi);Wherein, ln (Vi) it is to be directed toward participle point ViParticiple point set, Out (Vj) it is participle point ViIt is directed toward other participle points
Set, d is adjustment factor, wjiTo segment point vjTo participle point viSide weight, wiFor participle point ViComprehensive weight;
3) problem direction of some participles as the problem is chosen according to the final weight of participle point.
3. method as claimed in claim 2, it is characterised in that participle point ViComprehensive weight wi=w1*Ai+w2*Bi+w3*Ci
+w4*Di;Wherein, AiTo segment point ViTF-IDF, its weight is w1;BiRepresent participle point ViPosition, its weight is w2, CiGeneration
Partitive point ViPart of speech, its weight is w3, DiRepresent participle point ViLength, its weight is w4。
4. the method as described in claim 1, it is characterised in that the method for obtaining the problem key technology is:From problem
Key technology document, develop final report title in extract descriptor, the key technology as the problem;Semantic-based point
Analysis, analysis of key technical documentation, the text for developing final report technology, find the word of band " technology " in text, judge " skill
Whether the word segmentation result before art " is noun, name verb or gerundial form, if it is, point by " technology " and its above
Key technology of the word result as the problem;Key technology document, develop final report technology text in find name verb
Combination or gerund combination, and be text key word, then as key technology.
5. the method as described in claim 1, it is characterised in that problem direction, key technology are merged respectively, will be of the same race
Type, similar problem direction are merged together, and same kind, similar key technology are merged together.
6. method as claimed in claim 5, it is characterised in that carry out the problem direction under same domain for new problem direction
Similarity measure, be maximized max, if max exceedes given threshold K, by the new problem direction as with its most phase
Like the alternate item in problem direction, otherwise the new problem direction is added under corresponding field;Carried out for new key technology same
Key technology similarity measure under problem direction, is maximized max, if max exceedes given threshold G, by the new pass
Otherwise key technology adds the new key technology as the alternate item with its most like key technology under corresponding problem direction.
7. the method as described in claim 1~6 is any, it is characterised in that the basic information further includes project number, problem
Title, contract number, carrier, problem level of confidentiality, requisition number, sponsor sections, sponsor sections counterpart people, scientific and technical department director
And participant;The document information include project number, problem title, achievement form, main research, scientific research personnel,
Key technology and key technical index.
A kind of 8. knowledge mapping network building systems based on magnanimity scientific research data, it is characterised in that including problem parsing module,
Problem key technology extraction module and knowledge mapping generation module;Wherein,
Problem parsing module, for parsing project library, extracts the basic information of problem and parses the document of each problem
Information, problem direction;Wherein, the basic information includes problem fields;
Problem key technology extraction module, for extracting descriptor from the heading message of the setting document of each problem, as
The problem key technology of corresponding problem;
Knowledge mapping generation module, for being clustered to the problem direction for belonging to same area;For in same cluster result
Problem, general charter from each problem, parse key technical index in demand analysis explanation, then according to same cluster
As a result the problem key technology of each problem and the degree of correlation of key technical index in, to problem key technology and key technical index
It is associated, forms each key technology and correspond to some key technical index;Ultimately generate field-problem direction-problem-key
Contingency table between technology-key technical index, i.e., the knowledge mapping of described project library.
9. system as claimed in claim 8, it is characterised in that the problem parsing module to the destination document content of problem into
Row participle, digraph is formed according to word segmentation result;Then for each participle point V in the digraphi, use formulaCalculate participle point ViFinal power
Weight S (vi);Wherein, ln (Vi) it is to be directed toward participle point ViParticiple point set, Out (Vj) it is participle point ViIt is directed toward other participles
The set of point, d is adjustment factor, wjiTo segment point vjTo participle point viSide weight, wiFor participle point ViSynthetic weights
Weight;Then problem direction of some participles as the problem is chosen according to the final weight of participle point.
A kind of 10. computer-readable recording medium for storing computer program, it is characterised in that storage computer program, it is described
Computer program includes instruction, and described instruction is included such as each step in any one of claim 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710928133.9A CN107967290A (en) | 2017-10-09 | 2017-10-09 | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710928133.9A CN107967290A (en) | 2017-10-09 | 2017-10-09 | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107967290A true CN107967290A (en) | 2018-04-27 |
Family
ID=61997426
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710928133.9A Pending CN107967290A (en) | 2017-10-09 | 2017-10-09 | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107967290A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109087053A (en) * | 2018-06-01 | 2018-12-25 | 平安科技(深圳)有限公司 | Synergetic office work processing method, device, equipment and medium based on associated topologies figure |
CN109241278A (en) * | 2018-07-18 | 2019-01-18 | 绍兴诺雷智信息科技有限公司 | Scientific research knowledge management method and system |
CN110119473A (en) * | 2019-05-23 | 2019-08-13 | 北京金山数字娱乐科技有限公司 | A kind of construction method and device of file destination knowledge mapping |
CN111126034A (en) * | 2019-12-17 | 2020-05-08 | 南京医基云医疗数据研究院有限公司 | Medical variable relation processing method and device, computer medium and electronic equipment |
CN112800243A (en) * | 2021-02-04 | 2021-05-14 | 天津德尔塔科技有限公司 | Project budget analysis method and system based on knowledge graph |
CN113569060A (en) * | 2021-09-24 | 2021-10-29 | 中国电子技术标准化研究院 | Standard text based knowledge graph disambiguation method, system, device and medium |
CN113642031A (en) * | 2021-10-15 | 2021-11-12 | 中国铁道科学研究院集团有限公司科学技术信息研究所 | Subject acceptance method and system |
CN115186111A (en) * | 2022-09-13 | 2022-10-14 | 中国医学科学院医学信息研究所 | Index data semantic association and fusion method, system and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102760058A (en) * | 2012-04-05 | 2012-10-31 | 中国人民解放军国防科学技术大学 | Massive software project sharing method oriented to large-scale collaborative development |
CN103927360A (en) * | 2014-04-18 | 2014-07-16 | 北京大学 | Software project semantic information presentation and retrieval method based on graph model |
CN106205248A (en) * | 2016-08-31 | 2016-12-07 | 北京师范大学 | A kind of representative learning person generates system and method at the on-line study cognitive map of domain-specific knowledge learning and mastering state |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
CN106844658A (en) * | 2017-01-23 | 2017-06-13 | 中山大学 | A kind of Chinese text knowledge mapping method for auto constructing and system |
CN107193870A (en) * | 2017-04-12 | 2017-09-22 | 广东万丈金数信息技术股份有限公司 | The extracting method and system of web page contents |
-
2017
- 2017-10-09 CN CN201710928133.9A patent/CN107967290A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102760058A (en) * | 2012-04-05 | 2012-10-31 | 中国人民解放军国防科学技术大学 | Massive software project sharing method oriented to large-scale collaborative development |
CN103927360A (en) * | 2014-04-18 | 2014-07-16 | 北京大学 | Software project semantic information presentation and retrieval method based on graph model |
CN106205248A (en) * | 2016-08-31 | 2016-12-07 | 北京师范大学 | A kind of representative learning person generates system and method at the on-line study cognitive map of domain-specific knowledge learning and mastering state |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
CN106844658A (en) * | 2017-01-23 | 2017-06-13 | 中山大学 | A kind of Chinese text knowledge mapping method for auto constructing and system |
CN107193870A (en) * | 2017-04-12 | 2017-09-22 | 广东万丈金数信息技术股份有限公司 | The extracting method and system of web page contents |
Non-Patent Citations (3)
Title |
---|
王炎: "基于多数据源的专家学术网络构建及其应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
葛斌等: "基于无向图构建策略的主题句抽取", 《计算机科学》 * |
陈兴元等: "科研活动与知识图谱关系的探讨", 《无线互联科技》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109087053A (en) * | 2018-06-01 | 2018-12-25 | 平安科技(深圳)有限公司 | Synergetic office work processing method, device, equipment and medium based on associated topologies figure |
CN109087053B (en) * | 2018-06-01 | 2023-05-09 | 平安科技(深圳)有限公司 | Collaborative office processing method, device, equipment and medium based on association topological graph |
CN109241278A (en) * | 2018-07-18 | 2019-01-18 | 绍兴诺雷智信息科技有限公司 | Scientific research knowledge management method and system |
CN109241278B (en) * | 2018-07-18 | 2022-04-26 | 绍兴诺雷智信息科技有限公司 | Scientific research knowledge management method and system |
CN110119473A (en) * | 2019-05-23 | 2019-08-13 | 北京金山数字娱乐科技有限公司 | A kind of construction method and device of file destination knowledge mapping |
CN111126034A (en) * | 2019-12-17 | 2020-05-08 | 南京医基云医疗数据研究院有限公司 | Medical variable relation processing method and device, computer medium and electronic equipment |
CN111126034B (en) * | 2019-12-17 | 2023-09-19 | 南京医基云医疗数据研究院有限公司 | Medical variable relation processing method and device, computer medium and electronic equipment |
CN112800243A (en) * | 2021-02-04 | 2021-05-14 | 天津德尔塔科技有限公司 | Project budget analysis method and system based on knowledge graph |
CN113569060A (en) * | 2021-09-24 | 2021-10-29 | 中国电子技术标准化研究院 | Standard text based knowledge graph disambiguation method, system, device and medium |
CN113642031A (en) * | 2021-10-15 | 2021-11-12 | 中国铁道科学研究院集团有限公司科学技术信息研究所 | Subject acceptance method and system |
CN115186111A (en) * | 2022-09-13 | 2022-10-14 | 中国医学科学院医学信息研究所 | Index data semantic association and fusion method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107967290A (en) | A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data | |
US11573996B2 (en) | System and method for hierarchically organizing documents based on document portions | |
US8676815B2 (en) | Suffix tree similarity measure for document clustering | |
US9715493B2 (en) | Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
Inzalkar et al. | A survey on text mining-techniques and application | |
Lin et al. | An integrated approach to extracting ontological structures from folksonomies | |
CN110019689A (en) | Position matching process and position matching system | |
CN113190687B (en) | Knowledge graph determining method and device, computer equipment and storage medium | |
Das et al. | A CV parser model using entity extraction process and big data tools | |
Nikhil et al. | A survey on text mining and sentiment analysis for unstructured web data | |
Wang et al. | Neural related work summarization with a joint context-driven attention mechanism | |
Gong et al. | Phrase-based hashtag recommendation for microblog posts. | |
KR101476225B1 (en) | Method for Indexing Natural Language And Mathematical Formula, Apparatus And Computer-Readable Recording Medium with Program Therefor | |
CN109902230A (en) | A kind of processing method and processing device of news data | |
Çelebi et al. | Automatic question answering for Turkish with pattern parsing | |
Tran et al. | A named entity recognition approach for tweet streams using active learning | |
Aljević et al. | Extractive text summarization based on selectivity ranking | |
Ahmed et al. | Building multiview analyst profile from multidimensional query logs: from consensual to conflicting preferences | |
CN112711695A (en) | Content-based search suggestion generation method and device | |
Zeng et al. | Construction of scenic spot knowledge graph based on ontology | |
Lim et al. | Generalized and lightweight algorithms for automated web forum content extraction | |
Singh et al. | EfficientPMM: Finite Automata Based Efficient Pattern Matching Machine | |
Magnini et al. | Entailment graphs for text analytics in the excitement project | |
Bernardes et al. | Exploring NPL: Generating Automatic Control Keywords |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20220624 |
|
AD01 | Patent right deemed abandoned |