CN108874990A - A kind of method and system extracted based on power technology journal article unstructured data - Google Patents
A kind of method and system extracted based on power technology journal article unstructured data Download PDFInfo
- Publication number
- CN108874990A CN108874990A CN201810600133.0A CN201810600133A CN108874990A CN 108874990 A CN108874990 A CN 108874990A CN 201810600133 A CN201810600133 A CN 201810600133A CN 108874990 A CN108874990 A CN 108874990A
- Authority
- CN
- China
- Prior art keywords
- paper
- data
- module
- character
- forward position
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to technical field of data processing, disclose a kind of method and system extracted based on power technology journal article unstructured data, and data extraction system includes:Input module, search module, big data analysis module, extraction module, data memory module.The present invention generates new search result and offer according to the corresponding entity identifier of property parameters by search module, i.e. since property parameters can be used as an entity, by the way that entity is converted to entity identifier, according to the uniqueness of entity identifier, obtain corresponding search result, it thoroughly solves the problems, such as entity is born the same name, the part of long search term matches etc., improves the accuracy of search result, improve data extraction efficiency;The response time that inquiry can be effectively reduced by big data analysis module simultaneously can quickly analyze paper data element, improve data extraction rate.
Description
Technical field
The invention belongs to technical field of data processing, more particularly to one kind to be based on the unstructured number of power technology journal article
According to the method and system of extraction.
Background technique
Currently, the prior art commonly used in the trade is such:
With the fast development of computer technology and internet, online enquiries, retrieval and downloading expert data have become and work as
The important means of preceding science and technology information retrieval, for online all kinds of full-text databases or Digest Database, the rope of the abstract of a thesis
Draw be reader's searching document important tool, for scientific and technological information Literature retrieval construction and maintenance provide conveniently.Abstract
It is the introduction comprehensive to paper, makes one to understand the main contents that paper illustrates.After paper publishing, abstract journal or various databases
Abstract can not be made an amendment or be modified slightly and directly utilized, allow reader to understand the main contents of paper as early as possible, with adjunctive program
The deficiency of name, so that other people be avoided to write, abstract is issuable to be misread, is short of even mistake.So the quality of the abstract of a thesis is high
It is low, it directly affects being retrieved for paper and rate and is drawn the frequency.However, tradition leans on the search of keyword match mode, can not solve
Certainly part matching problem, may be inaccurate so as to cause search result, influences data extraction efficiency;Meanwhile it cannot provide more
The information about development grain between paper, which results in researchers can not efficiently obtain Scientific Research Resource, to opinion
Literary data analyzing speed is slow.
In conclusion problem of the existing technology is:
Tradition leans on the search of keyword match mode, can not solve part matching problem, may so as to cause search result
Inaccurately, data extraction efficiency is influenced;
Meanwhile the more information about development grain between paper cannot be provided, which results in researchers cannot
It is enough efficiently to obtain Scientific Research Resource, it is slow to paper data analyzing speed.
Under complex data environment, the initial characteristics collection dimension after feature extraction is continuously increased with fusion feature dimension
May be very high, then information redundancy is certainly existed between feature, the effect of classification is deteriorated;And in existing Feature Extraction Technology, all
It is that will minimize the scale of character subset as another optimization aim, still on the basis of existing single object optimization technology
The scale of character subset is a dispersive target, and the solution usually acquired, which is concentrated, can only correspond to a solution under each Feature-scale, this
So that the other feature subset that scale is identical but specific features are different can not be found.And these character subset dimensions are also for letter
Number feature extraction is also useful.In addition, it is a series of compromise solution that multiple target feature selecting algorithm is finally obtained, need from
The middle solution for choosing function admirable, but it is currently available that unsupervised approaches are also less.Main difficulty is:Designed feature
Collection evaluation function and search strategy fail to consider the redundancy and correlation of character subset;Interpretational criteria does not also consider character subset
Dimension selection to classification validity influence;The Pareto solution of multi-objective optimization algorithm concentrates unsupervised mode to extract spy
The importance sorting for levying dimension and character subset is still unresolved.
Summary of the invention
In view of the problems of the existing technology, the present invention provides one kind to be based on the unstructured number of power technology journal article
According to the method for extraction.
The invention is realized in this way a method of it is extracted based on power technology journal article unstructured data, packet
It includes:
Title and the paper path of paper are inputted by input module;
Papers contents critical data information is searched for by search module;
After local paper data set is analyzed and is handled accordingly by big data analysis module in the database
Paper citation network is constructed, papers contents and correlative theses are analyzed;
By extraction module initialization data memory module, the forward position Pareto point quantity is less than default value R, then directly will
All the points are stored in data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance
It removes, until the forward position the Pareto point quantity of alternative deposit data memory module is equal with default value;Then by these forward position points
It is stored in data memory module;Authors of Science Articles information, abstract, keyword core data information are extracted again;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,
Indicate the maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
The core data information of the author information extracted, abstract, keyword is stored by data memory module.
Further, the extracting method of extraction module specifically includes:
Step 1 calculates the forward position Pareto point, calculates radar signal feature with redundancy objective function according to the degree of correlation
The fitness of body, and the forward position the Pareto point in current character individual is found out, time complexity is O (N2);
Step 2, initialization data memory module, the forward position Pareto point quantity are less than default value R, then will directly own
In point deposit data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)
The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit number
It is equal with default value according to the forward position the Pareto point quantity of memory module;Then by these forward positions, point is stored in data memory module
In;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,
Indicate the maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
Step 3 calls splitting rule to create underlying membrane, after completing preparation, starts division in the film of surface layer and generates M
Underlying membrane;It is equal with the forward position the Pareto point quantity of data memory module to divide underlying membrane quantity M;Then these are achieved
Optimum individual of the forward position the Pareto point as population in the underlying membrane;Finally, being put into remaining each individual apart from itself recently
The forward position Pareto point where in underlying membrane, time complexity is O (N × R);
Step 4, independently executes particle swarm algorithm in underlying membrane, in each underlying membrane, to be stored in data memory module at first
The interior forward position Pareto point is population optimum individual, formula (3) Xt+1=Xt+Vt+1With and formulaWith formula (4) Π=(V, T, C, μ, ω1,…,ωm,(R1,ρ1),…
(Rm,ρm)), calculate new individual speed and position.And fitness is recalculated according to newest position;Wherein, in formula (3),
Vt, Vt+1It is the speed of t and the t+1 times flight respectively;Xt, Xt+1It is that particle is fallen in after t and the t+1 times flight respectively
Position;In formula (4), V is alphabet, and included element is character object;It is to intracellular metabolic element, object
Matter is abstracted;For output alphabet;For catalyst, these elements do not become during Cellular evolution
Change, does not also generate new character;But must have its participation in certain evolutionary rules could execute, will if there is no rule
It can not be performed;μ is the membrane structure comprising m film, and each film and its region enclosed are indicated with label set H, H=1,2 ...,
M }, wherein m is known as the degree of the membranous system;ωi∈ V* (1≤i≤m) indicates to contain the more of object inside the region i in membrane structure μ
Collect again, V* is the set for any character object that character forms in V;Ri(1≤i≤m) is the finite aggregate of evolutionary rule, each
RiIt is, ρ associated with the region i in membrane structure uiIt is RiIn partial ordering relation, referred to as dominance relation indicates rule RiIt executes
Dominance relation.RiEvolutionary rule be binary group (u, v), be generally written into u → v,In v character may belong to V can also
To be not belonging to V, but when producing the character object for being not belonging to V after certain rule executes, it is dissolved to execute the rule caudacoria;U's
The number of character object contained by length, that is, u is known as the radius of rule u → v;
Step 5, dissolution;After completing respective particle swarm algorithm, each underlying membrane rupture, by newly generated character individual
It is re-released into the film of surface layer;
Step 6 calculates forward position point, is put into data memory module;It is all in calculating step 5 to be released to surface layer film character
The forward position Pareto point;And these are put in deposit data memory module;
Step 7 calculates non-dominated ranking, updates data memory module, judges whether data memory module character quantity surpasses
Limit out, if beyond limitation, again in archives all characters crowding distance;It is deleted one by one since the smallest point of crowding distance
It removes, until character quantity is equal with default value in data memory module, time complexity is O (D × 2R × log (2R));
Step 8, iteration judge whether current state meets the condition of end loop;If conditions are not met, then continuing to execute
Step 3;If it is satisfied, executing all character steps in output external archival;
It needs to carry out initialization and Fitness analysis before calculating the forward position Pareto point;N number of character is generated in the film of surface layer, is indicated
The radar emitter signal feature set number of extraction, each character include that D ties up variable, and is meeting multi-objective optimization question constraint
Under the premise of condition, successively N number of character is initialized, coding mode uses binary coding mode;Individual x={ x1,
x2,...,xDValue range { 0,1 }, when value is 1, this feature is selected;When initialization, calculates all sample values and exist
Then variance in each feature calculates selected probability according to following formula;
vjIndicate the variance of all sample values on jth dimensional feature;When P is greater than 0.5, this feature is easy to choose.
Further, described search block search method is as follows:
Firstly, receiving search term;
Then, multiple search results are generated according to described search word and provided, wherein each search result includes multiple categories
Property parameter, wherein in the multiple property parameters at least part property parameters have corresponding entity identifier;Entity identifier
Property parameters include author's name and/or deliver place;
Finally, when the property parameters in search result are triggered, it is raw according to the corresponding entity identifier of the property parameters
The search result of Cheng Xin and offer;
The big data analysis module analysis method is as follows:
Step 1:Paper reference is constructed after local paper data set is analyzed and handled accordingly in the database
Network;
Step 2:According to the adduction relationship creation analysis algorithm in paper citation network, by described in parser acquisition
The importance and mutual relationship of paper citation network interior joint, and obtain different degree of the paper relative to center paper;Institute
The center paper of stating refers to:The a certain piece paper that user is inquired by input;
Step 3:Convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping in the direction that is cited
Collection obtains the development path between specified paper in the paper citation network, and according to the paper different degree obtained in step 2
To calculate the different degree in path.
Further, before the reception search term, further include:
Obtain multiple papers;
Extract mechanism locating for the corresponding author's name of each paper and author respectively from the multiple paper;
If the corresponding author's name of paper be it is unique, the entity identifier is generated according to the author's name;And
If the corresponding author's name of paper is not uniquely, according to the life of mechanism locating for the author's name and the author
At the entity identifier.
The step 1 includes:
Step 1.1:Using text-processing and analytical technology, the reference information in local paper data, reference information are extracted
It is the information which paper is referred to comprising any paper in collection of thesis;
Step 1.2:Construct paper citation network;
Step 1.3:Duplicate content, utilization are removed after being compared to the reference information in the paper citation network of acquisition
Database software is stored and is established index, and the adduction relationship between paper is stored in database in the form of key-value pair
In.
Further, the step 2 includes:
Step 2.1:According to the adduction relationship in paper citation network, the score based on Crosslinking Structural is calculated;
Step 2.2:It is opposite to calculate each paper for the subgraph paid close attention to using breadth first algorithm search spread user
In the longest path of center paper and the ratio of shortest path, as the score based on reference step analysis, breadth first algorithm
Calculation formula it is as follows:
Score=longest path/shortest path based on reference step analysis;
Step 2.3:Corresponding weight is selected, by the score based on Crosslinking Structural and point based on reference step analysis
Different degree of the number included together as other final papers relative to center paper, calculation formula are as follows:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference level point
Score of the weight * of analysis based on reference step analysis;
The step 3 includes:
Step 3.1:The mapping ensemblen in reference direction is converted by the one-to-one adduction relationship of paper in database and is drawn
With the mapping ensemblen in direction;
Step 3.2:The adduction relationship of preliminary analysis paper takes python program design language to call the turn the data knot of dictionary
Structure;
Step 3.3:Extract the information in path between two papers.
Another object of the present invention, which is to provide, to be mentioned described in a kind of realization based on power technology journal article unstructured data
The computer program of the method taken.
Another object of the present invention, which is to provide, to be mentioned described in a kind of realization based on power technology journal article unstructured data
The information data processing terminal of the method taken.
Another object of the present invention is to provide a kind of computer readable storage medium, including instruction, when its on computers
When operation, so that computer executes the method extracted based on power technology journal article unstructured data.
Another object of the present invention, which is to provide, to be mentioned described in a kind of realization based on power technology journal article unstructured data
The method taken based on power technology journal article unstructured data extraction system, including:
Input module is connect with search module, for inputting title and the paper path of paper;
Search module is connect with input module, big data analysis module, for searching for papers contents critical data information;
Big data analysis module is connect with search module, extraction module, for dividing papers contents and correlative theses
Analysis;
Extraction module is connect with big data analysis module, data memory module, for extract Authors of Science Articles information, abstract,
The core datas information such as keyword;
Data memory module is connect with extraction module, for by cores such as the author information extracted, abstract, keywords
Data information is stored.
Another object of the present invention is to be based on power technology journal article unstructured data described in providing one kind is equipped with
The data extraction device of extraction system.
Advantages of the present invention and good effect are:
The present invention can first receive search term by search module, generate multiple search results according to search term later and mention
For, wherein each search result includes multiple property parameters, and at least part property parameters have correspondence in multiple property parameters
Entity identifier generated according to the corresponding entity identifier of property parameters new when the property parameters in search result are triggered
Search result simultaneously provides, i.e., since property parameters can be used as an entity, by the way that entity is converted to entity identifier, according to entity
The uniqueness of mark, obtains corresponding search result, thoroughly solves entity duplication of name, part matching of long search term etc.
Problem improves the accuracy of search result, improves data extraction efficiency;It simultaneously can be more effective by big data analysis module
Ground shows the contribution and relative importance that other papers make the paper of user query, enables staff more easily
From an interested paper, other correlative theses are found;The distributed approach of big data analysis, this is beneficial to
Improve the speed of academic big data analysis;When establishing paper inquiry system using the present invention, the distributed treatment is utilized
The response time of inquiry can be effectively reduced in method, can quickly analyze paper data element, improves data and extracts speed
Degree.
Feature extracting method of the invention is compared with MOPSO, SPEA2, PESA2 algorithm, and inventive algorithm has essence
The performances such as exactness, convergence rate, distribution of results uniformity coefficient.Has fast convergence rate, the approximate forward position Pareto point distribution is equal
It the features such as even, being capable of the preferable forward position approaching to reality Pareto.Therefore, it can prove that new algorithm is solving multi-objective optimization question
Aspect is feasible, effective.
The extracted important feature subset of multiple target feature selecting of the invention shows good in SNR=4dB or more
Cluster property can obviously divide between signal, sharpness of border no overlap, can simplify the design of sorter, improve sorting discrimination,
Be conducive to practical application.Finally independent 100 times are carried out to signal characteristic subset using traditional FCM clustering algorithm to test, MPSO,
The average cluster accuracy that NSGAII and SPEA2 algorithm obtains is respectively 99%, 82%, 78%.Illustrate proposed algorithm
Recovery rate with higher.
Detailed description of the invention
Fig. 1 is the method flow extracted based on power technology journal article unstructured data that the present invention implements to provide
Figure.
Fig. 2 is present invention implementation offer based on power technology journal article unstructured data extraction system structural frames
Figure.
In Fig. 2:1, input module;2, search module;3, big data analysis module;4, extraction module;5, data store mould
Block.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
With reference to the accompanying drawing and specific embodiment is further described application principle of the invention.
As shown in Figure 1, a kind of method extracted based on power technology journal article unstructured data provided by the invention
Include the following steps:
S101 inputs title and the paper path of paper by input module;
S102 searches for papers contents critical data information by search module;
S103 analyzes papers contents and correlative theses by big data analysis module;
S104 extracts the core datas information such as Authors of Science Articles information, abstract, keyword by extraction module;
S105 is carried out the core datas information such as the author information extracted, abstract, keyword by data memory module
Storage.
As shown in Fig. 2, provided in an embodiment of the present invention be based on power technology journal article unstructured data extraction system,
Including:Input module 1, search module 2, big data analysis module 3, extraction module 4, data memory module 5.
Input module 1 is connect with search module 2, for inputting title and the paper path of paper;
Search module 2 is connect with input module 1, big data analysis module 3, for searching for papers contents critical data letter
Breath;
Big data analysis module 3 is connect with search module 2, extraction module 4, for papers contents and correlative theses into
Row analysis;
Extraction module 4 is connect with big data analysis module 3, data memory module 5, for extracting Authors of Science Articles information, plucking
It wants, the core datas information such as keyword;
Data memory module 5 is connect with extraction module 4, for by cores such as the author information extracted, abstract, keywords
Heart data information is stored.
Below with reference to concrete analysis, the invention will be further described.
The method provided in an embodiment of the present invention extracted based on power technology journal article unstructured data, including:
The extracting method of extraction module specifically includes:
Step 1 calculates the forward position Pareto point, calculates radar signal feature with redundancy objective function according to the degree of correlation
The fitness of body, and the forward position the Pareto point in current character individual is found out, time complexity is O (N2);
Step 2, initialization data memory module, the forward position Pareto point quantity are less than default value R, then will directly own
In point deposit data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)
The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit number
It is equal with default value according to the forward position the Pareto point quantity of memory module;Then by these forward positions, point is stored in data memory module
In;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,
Indicate the maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
Step 3 calls splitting rule to create underlying membrane, after completing preparation, starts division in the film of surface layer and generates M
Underlying membrane;It is equal with the forward position the Pareto point quantity of data memory module to divide underlying membrane quantity M;Then these are achieved
Optimum individual of the forward position the Pareto point as population in the underlying membrane;Finally, being put into remaining each individual apart from itself recently
The forward position Pareto point where in underlying membrane, time complexity is O (N × R);
Step 4, independently executes particle swarm algorithm in underlying membrane, in each underlying membrane, to be stored in data memory module at first
The interior forward position Pareto point is population optimum individual, formula (3) Xt+1=Xt+Vt+1With and formulaWith formula (4) Π=(V, T, C, μ, ω1,…,ωm,(R1,ρ1),
(Rm,ρm)), calculate new individual speed and position.And fitness is recalculated according to newest position;Wherein, in formula (3),
Vt, Vt+1It is the speed of t and the t+1 times flight respectively;Xt, Xt+1It is that particle is fallen in after t and the t+1 times flight respectively
Position;In formula (4), V is alphabet, and included element is character object;It is to intracellular metabolic element, object
Matter is abstracted;For output alphabet;For catalyst, these elements do not become during Cellular evolution
Change, does not also generate new character;But must have its participation in certain evolutionary rules could execute, will if there is no rule
It can not be performed;μ is the membrane structure comprising m film, and each film and its region enclosed are indicated with label set H, H=1,2 ...,
M }, wherein m is known as the degree of the membranous system;ωi∈ V* (1≤i≤m) indicates to contain the more of object inside the region i in membrane structure μ
Collect again, V* is the set for any character object that character forms in V;Ri(1≤i≤m) is the finite aggregate of evolutionary rule, each
RiIt is, ρ associated with the region i in membrane structure uiIt is RiIn partial ordering relation, referred to as dominance relation indicates rule RiIt executes
Dominance relation.RiEvolutionary rule be binary group (u, v), be generally written into u → v,In v character may belong to V can also
To be not belonging to V, but when producing the character object for being not belonging to V after certain rule executes, it is dissolved to execute the rule caudacoria;U's
The number of character object contained by length, that is, u is known as the radius of rule u → v;
Step 5, dissolution;After completing respective particle swarm algorithm, each underlying membrane rupture, by newly generated character individual
It is re-released into the film of surface layer;
Step 6 calculates forward position point, is put into data memory module;It is all in calculating step 5 to be released to surface layer film character
The forward position Pareto point;And these are put in deposit data memory module;
Step 7 calculates non-dominated ranking, updates data memory module, judges whether data memory module character quantity surpasses
Limit out, if beyond limitation, again in archives all characters crowding distance;It is deleted one by one since the smallest point of crowding distance
It removes, until character quantity is equal with default value in data memory module, time complexity is O (D × 2R × log (2R));
Step 8, iteration judge whether current state meets the condition of end loop;If conditions are not met, then continuing to execute
Step 3;If it is satisfied, executing all character steps in output external archival;
It needs to carry out initialization and Fitness analysis before calculating the forward position Pareto point;N number of character is generated in the film of surface layer, is indicated
The radar emitter signal feature set number of extraction, each character include that D ties up variable, and is meeting multi-objective optimization question constraint
Under the premise of condition, successively N number of character is initialized, coding mode uses binary coding mode;Individual x={ x1,
x2,...,xDValue range { 0,1 }, when value is 1, this feature is selected;When initialization, calculates all sample values and exist
Then variance in each feature calculates selected probability according to following formula;
vjIndicate the variance of all sample values on jth dimensional feature;When P is greater than 0.5, this feature is easy to choose.
Described search block search method is as follows:
Firstly, receiving search term;
Then, multiple search results are generated according to described search word and provided, wherein each search result includes multiple categories
Property parameter, wherein in the multiple property parameters at least part property parameters have corresponding entity identifier;Entity identifier
Property parameters include author's name and/or deliver place;
Finally, when the property parameters in search result are triggered, it is raw according to the corresponding entity identifier of the property parameters
The search result of Cheng Xin and offer;
The big data analysis module analysis method is as follows:
Step 1:Paper reference is constructed after local paper data set is analyzed and handled accordingly in the database
Network;
Step 2:According to the adduction relationship creation analysis algorithm in paper citation network, by described in parser acquisition
The importance and mutual relationship of paper citation network interior joint, and obtain different degree of the paper relative to center paper;Institute
The center paper of stating refers to:The a certain piece paper that user is inquired by input;
Step 3:Convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping in the direction that is cited
Collection obtains the development path between specified paper in the paper citation network, and according to the paper different degree obtained in step 2
To calculate the different degree in path.
Before the reception search term, further include:
Obtain multiple papers;
Extract mechanism locating for the corresponding author's name of each paper and author respectively from the multiple paper;
If the corresponding author's name of paper be it is unique, the entity identifier is generated according to the author's name;And
If the corresponding author's name of paper is not uniquely, according to the life of mechanism locating for the author's name and the author
At the entity identifier.
The step 1 includes:
Step 1.1:Using text-processing and analytical technology, the reference information in local paper data, reference information are extracted
It is the information which paper is referred to comprising any paper in collection of thesis;
Step 1.2:Construct paper citation network;
Step 1.3:Duplicate content, utilization are removed after being compared to the reference information in the paper citation network of acquisition
Database software is stored and is established index, and the adduction relationship between paper is stored in database in the form of key-value pair
In.
Further, the step 2 includes:
Step 2.1:According to the adduction relationship in paper citation network, the score based on Crosslinking Structural is calculated;
Step 2.2:It is opposite to calculate each paper for the subgraph paid close attention to using breadth first algorithm search spread user
In the longest path of center paper and the ratio of shortest path, as the score based on reference step analysis, breadth first algorithm
Calculation formula it is as follows:
Score=longest path/shortest path based on reference step analysis;
Step 2.3:Corresponding weight is selected, by the score based on Crosslinking Structural and point based on reference step analysis
Different degree of the number included together as other final papers relative to center paper, calculation formula are as follows:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference level point
Score of the weight * of analysis based on reference step analysis;
The step 3 includes:
Step 3.1:The mapping ensemblen in reference direction is converted by the one-to-one adduction relationship of paper in database and is drawn
With the mapping ensemblen in direction;
Step 3.2:The adduction relationship of preliminary analysis paper takes python program design language to call the turn the data knot of dictionary
Structure;
Step 3.3:Extract the information in path between two papers.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or
Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to
Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network
Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one
Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one
A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center
Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access
The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie
Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid
State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (10)
1. a kind of method extracted based on power technology journal article unstructured data, which is characterized in that described to be based on electric power
Technical journal paper unstructured data extract method include:
Papers contents critical data information is searched for by search module;
It is constructed in the database after local paper data set is analyzed and handled accordingly by big data analysis module
Paper citation network, analyzes papers contents and correlative theses;
By extraction module initialization data memory module, the forward position Pareto point quantity is less than default value R, then will directly own
In point deposit data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)
The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit number
It is equal with default value according to the forward position the Pareto point quantity of memory module;Then by these forward positions, point is stored in data memory module
In;Authors of Science Articles information, abstract, keyword core data information are extracted again;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,It indicates
The maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
The core data information of the author information extracted, abstract, keyword is stored by data memory module.
2. the method extracted as described in claim 1 based on power technology journal article unstructured data, which is characterized in that
The extracting method of extraction module specifically includes:
Step 1 calculates the forward position Pareto point, calculates radar signal characteristic individual with redundancy objective function according to the degree of correlation
Fitness, and the forward position the Pareto point in current character individual is found out, time complexity is O (N2);
Step 2, initialization data memory module, the forward position Pareto point quantity are less than default value R, then directly deposit all the points
Enter in data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)It calculates
The crowding distance of all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit data are deposited
The forward position the Pareto point quantity for storing up module is equal with default value;Then by these forward positions, point is stored in data memory module;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,It indicates
The maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
Step 3 calls splitting rule to create underlying membrane, and after completing preparation, it is basic to start division generation M in the film of surface layer
Film;It is equal with the forward position the Pareto point quantity of data memory module to divide underlying membrane quantity M;Then by the Pareto of these archives
Optimum individual of the forward position point as population in the underlying membrane;Finally, remaining each individual is put into nearest apart from itself
Where the point of the forward position Pareto in underlying membrane, time complexity is O (N × R);
Step 4, independently executes particle swarm algorithm in underlying membrane, in each underlying membrane, to be stored in data memory module at first
The forward position Pareto point is population optimum individual, formula (3) Xt+1=Xt+Vt+1With and formulaWith formula (4) Π=(V, T, C, μ, ω1,…,ωm,(R1,ρ1),…
(Rm,ρm)), calculate new individual speed and position.And fitness is recalculated according to newest position;Wherein, in formula (3),
Vt, Vt+1It is the speed of t and the t+1 times flight respectively;Xt, Xt+1It is that particle is fallen in after t and the t+1 times flight respectively
Position;In formula (4), V is alphabet, and included element is character object;It is to intracellular metabolic element, object
Matter is abstracted;For output alphabet;For catalyst, these elements do not become during Cellular evolution
Change, does not also generate new character;But must have its participation in certain evolutionary rules could execute, will if there is no rule
It can not be performed;μ is the membrane structure comprising m film, and each film and its region enclosed are indicated with label set H, H=1,2 ...,
M }, wherein m is known as the degree of the membranous system;ωi∈V*(1≤i≤m) indicates to contain the more of object inside the region i in membrane structure μ
Collect again, V*It is the set for any character object that character forms in V;Ri(1≤i≤m) is the finite aggregate of evolutionary rule, each
RiIt is, ρ associated with the region i in membrane structure uiIt is RiIn partial ordering relation, referred to as dominance relation indicates rule RiIt executes
Dominance relation.RiEvolutionary rule be binary group (u, v), be generally written into u → v,In v character may belong to V can also
To be not belonging to V, but when producing the character object for being not belonging to V after certain rule executes, it is dissolved to execute the rule caudacoria;U's
The number of character object contained by length, that is, u is known as the radius of rule u → v;
Step 5, dissolution;After completing respective particle swarm algorithm, each underlying membrane rupture, again by newly generated character individual
It is discharged into the film of surface layer;
Step 6 calculates forward position point, is put into data memory module;Calculate all surface layer film characters of being released in step 5
The forward position Pareto point;And these are put in deposit data memory module;
Step 7 calculates non-dominated ranking, updates data memory module, judges whether data memory module character quantity exceeds limit
System, if beyond limitation, again in archives all characters crowding distance;It is deleted one by one since the smallest point of crowding distance,
Until character quantity is equal with default value in data memory module, time complexity is O (D × 2R × log (2R));
Step 8, iteration judge whether current state meets the condition of end loop;If conditions are not met, then continuing to execute step
Three;If it is satisfied, executing all character steps in output external archival;
It needs to carry out initialization and Fitness analysis before calculating the forward position Pareto point;N number of character is generated in the film of surface layer, indicates to extract
Radar emitter signal feature set number, each character includes that D ties up variable, and is meeting multi-objective optimization question constraint condition
Under the premise of, successively N number of character is initialized, coding mode uses binary coding mode;Individual x={ x1,x2,...,
xDValue range { 0,1 }, when value is 1, this feature is selected;When initialization, all sample values are calculated in each spy
Then variance in sign calculates selected probability according to following formula;
vjIndicate the variance of all sample values on jth dimensional feature;When P is greater than 0.5, this feature is easy to choose.
3. the method extracted as described in claim 1 based on power technology journal article unstructured data, which is characterized in that
Described search block search method is as follows:
Firstly, receiving search term;
Then, multiple search results are generated according to described search word and provided, wherein each search result includes multiple attribute ginsengs
Number, wherein at least part property parameters have corresponding entity identifier in the multiple property parameters;The attribute of entity identifier
Parameter includes author's name and/or delivers place;
Finally, being generated according to the corresponding entity identifier of the property parameters new when the property parameters in search result are triggered
Search result and offer;
The big data analysis module analysis method is as follows:
Step 1:Paper citation network is constructed in the database after local paper data set is analyzed and handled accordingly;
Step 2:According to the adduction relationship creation analysis algorithm in paper citation network, the paper is obtained by the parser
The importance of citation network interior joint and mutual relationship, and obtain different degree of the paper relative to center paper;In described
Heart paper refers to:The a certain piece paper that user is inquired by input;
Step 3:Convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping ensemblen in the direction that is cited,
The development path between specified paper is obtained in the paper citation network, and is counted according to the paper different degree obtained in step 2
Calculate the different degree in path.
4. the method extracted as claimed in claim 3 based on power technology journal article unstructured data, which is characterized in that
Before the reception search term, further include:
Obtain multiple papers;
Extract mechanism locating for the corresponding author's name of each paper and author respectively from the multiple paper;
If the corresponding author's name of paper be it is unique, the entity identifier is generated according to the author's name;And
If the corresponding author's name of paper is not uniquely, to generate institute according to mechanism locating for the author's name and the author
State entity identifier.
The step 1 includes:
Step 1.1:Using text-processing and analytical technology, the reference information in local paper data is extracted, reference information is packet
The information of which paper is referred to containing a paper any in collection of thesis;
Step 1.2:Construct paper citation network;
Step 1.3:Duplicate content is removed after being compared to the reference information in the paper citation network of acquisition, utilizes data
Library software is stored and is established index, and the adduction relationship between paper is stored in lane database in the form of key-value pair.
5. the method extracted as claimed in claim 3 based on power technology journal article unstructured data, which is characterized in that
The step 2 includes:
Step 2.1:According to the adduction relationship in paper citation network, the score based on Crosslinking Structural is calculated;
Step 2.2:The subgraph paid close attention to using breadth first algorithm search spread user calculates each paper relative in
The longest path of heart paper and the ratio of shortest path, as the score based on reference step analysis, the meter of breadth first algorithm
It is as follows to calculate formula:
Score=longest path/shortest path based on reference step analysis;
Step 2.3:Corresponding weight is selected, the score based on Crosslinking Structural and the score based on reference step analysis are closed
The different degree as other final papers relative to center paper, calculation formula are as follows together:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference step analysis
Score of the weight * based on reference step analysis;
The step 3 includes:
Step 3.1:Convert the one-to-one adduction relationship of paper in database to mapping ensemblen and the side of being cited in reference direction
To mapping ensemblen;
Step 3.2:The adduction relationship of preliminary analysis paper takes python program design language to call the turn the data structure of dictionary;
Step 3.3:Extract the information in path between two papers.
6. a kind of realize described in Claims 1 to 5 any one based on the extraction of power technology journal article unstructured data
The computer program of method.
7. a kind of realize described in Claims 1 to 5 any one based on the extraction of power technology journal article unstructured data
The information data processing terminal of method.
8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed
Benefit requires the method extracted described in 1-5 any one based on power technology journal article unstructured data.
9. it is a kind of realize described in claim 1 based on power technology journal article unstructured data extract method based on electricity
Power technical journal paper unstructured data extraction system, which is characterized in that described non-structural based on power technology journal article
Changing data extraction system includes:
Input module is connect with search module, for inputting title and the paper path of paper;
Search module is connect with input module, big data analysis module, for searching for papers contents critical data information;
Big data analysis module is connect with search module, extraction module, for analyzing papers contents and correlative theses;
Extraction module is connect with big data analysis module, data memory module, for extracting Authors of Science Articles information, abstract, key
The core datas information such as word;
Data memory module is connect with extraction module, for by core datas such as the author information extracted, abstract, keywords
Information is stored.
10. a kind of data being equipped with based on power technology journal article unstructured data extraction system described in claim 9
Extract equipment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810600133.0A CN108874990A (en) | 2018-06-12 | 2018-06-12 | A kind of method and system extracted based on power technology journal article unstructured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810600133.0A CN108874990A (en) | 2018-06-12 | 2018-06-12 | A kind of method and system extracted based on power technology journal article unstructured data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108874990A true CN108874990A (en) | 2018-11-23 |
Family
ID=64337997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810600133.0A Pending CN108874990A (en) | 2018-06-12 | 2018-06-12 | A kind of method and system extracted based on power technology journal article unstructured data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108874990A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109557529A (en) * | 2018-11-28 | 2019-04-02 | 中国人民解放军国防科技大学 | Radar target detection method based on generalized Pareto distribution clutter statistical modeling |
CN110188928A (en) * | 2019-05-15 | 2019-08-30 | 张会丽 | A kind of the formative optimization system and method for cloud data education training process |
CN110245208A (en) * | 2019-04-30 | 2019-09-17 | 广东省智能制造研究所 | A kind of retrieval analysis method, apparatus and medium based on big data storage |
CN111177323A (en) * | 2019-12-31 | 2020-05-19 | 国网安徽省电力有限公司安庆供电公司 | Power failure plan unstructured data extraction and identification method based on artificial intelligence |
CN112580305A (en) * | 2019-09-11 | 2021-03-30 | 陈涛 | Method for providing writing guide for writing and word processing equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100100562A1 (en) * | 2008-10-01 | 2010-04-22 | Jerry Millsap | Fully Parameterized Structured Query Language |
CN103279506A (en) * | 2013-05-15 | 2013-09-04 | 云南电力试验研究院(集团)有限公司电力研究院 | Method for extracting journal paper unstructured data based on electric power technology |
CN104239570A (en) * | 2014-09-30 | 2014-12-24 | 百度在线网络技术(北京)有限公司 | Method and device for searching for paper |
CN104376038A (en) * | 2014-09-12 | 2015-02-25 | 中国人民解放军信息工程大学 | Position associated text information visualization method based on label cloud |
CN105808729A (en) * | 2016-03-08 | 2016-07-27 | 上海交通大学 | Academic big data analysis method based on reference relationship among pieces of thesis |
CN107590436A (en) * | 2017-08-10 | 2018-01-16 | 云南财经大学 | Radar emitter signal feature selection approach based on peplomer subgroup multi-objective Algorithm |
-
2018
- 2018-06-12 CN CN201810600133.0A patent/CN108874990A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100100562A1 (en) * | 2008-10-01 | 2010-04-22 | Jerry Millsap | Fully Parameterized Structured Query Language |
CN103279506A (en) * | 2013-05-15 | 2013-09-04 | 云南电力试验研究院(集团)有限公司电力研究院 | Method for extracting journal paper unstructured data based on electric power technology |
CN104376038A (en) * | 2014-09-12 | 2015-02-25 | 中国人民解放军信息工程大学 | Position associated text information visualization method based on label cloud |
CN104239570A (en) * | 2014-09-30 | 2014-12-24 | 百度在线网络技术(北京)有限公司 | Method and device for searching for paper |
CN105808729A (en) * | 2016-03-08 | 2016-07-27 | 上海交通大学 | Academic big data analysis method based on reference relationship among pieces of thesis |
CN107590436A (en) * | 2017-08-10 | 2018-01-16 | 云南财经大学 | Radar emitter signal feature selection approach based on peplomer subgroup multi-objective Algorithm |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109557529A (en) * | 2018-11-28 | 2019-04-02 | 中国人民解放军国防科技大学 | Radar target detection method based on generalized Pareto distribution clutter statistical modeling |
CN109557529B (en) * | 2018-11-28 | 2019-08-30 | 中国人民解放军国防科技大学 | Radar target detection method based on generalized Pareto distribution clutter statistical modeling |
CN110245208A (en) * | 2019-04-30 | 2019-09-17 | 广东省智能制造研究所 | A kind of retrieval analysis method, apparatus and medium based on big data storage |
CN110188928A (en) * | 2019-05-15 | 2019-08-30 | 张会丽 | A kind of the formative optimization system and method for cloud data education training process |
CN112580305A (en) * | 2019-09-11 | 2021-03-30 | 陈涛 | Method for providing writing guide for writing and word processing equipment |
CN111177323A (en) * | 2019-12-31 | 2020-05-19 | 国网安徽省电力有限公司安庆供电公司 | Power failure plan unstructured data extraction and identification method based on artificial intelligence |
CN111177323B (en) * | 2019-12-31 | 2022-04-01 | 国网安徽省电力有限公司安庆供电公司 | Power failure plan unstructured data extraction and identification method based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108874990A (en) | A kind of method and system extracted based on power technology journal article unstructured data | |
Liu et al. | Text features extraction based on TF-IDF associating semantic | |
US10459971B2 (en) | Method and apparatus of generating image characteristic representation of query, and image search method and apparatus | |
CN111104511B (en) | Method, device and storage medium for extracting hot topics | |
WO2022156328A1 (en) | Restful-type web service clustering method fusing service cooperation relationships | |
Ren et al. | Heterogeneous graph-based intent learning with queries, web pages and wikipedia concepts | |
Chatterjee et al. | Single document extractive text summarization using genetic algorithms | |
CN109033314A (en) | The Query method in real time and system of extensive knowledge mapping in the case of memory-limited | |
Bounabi et al. | A comparison of text classification methods using different stemming techniques | |
CN110309234A (en) | A kind of client of knowledge based map holds position method for early warning, device and storage medium | |
Tang et al. | Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce. | |
Sutanto et al. | Fine-grained document clustering via ranking and its application to social media analytics | |
Zaw et al. | Web document clustering by using PSO-based cuckoo search clustering algorithm | |
CN109739984A (en) | A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform | |
Yin et al. | Sentence-bert and k-means based clustering technology for scientific and technical literature | |
Li et al. | Extracting core questions in community question answering based on particle swarm optimization | |
Anupama et al. | A novel approach using incremental oversampling for data stream mining | |
Weston et al. | Latent structured ranking | |
CN106971011A (en) | A kind of big data analysis method based on cloud platform | |
CN113157915A (en) | Naive Bayes text classification method based on cluster environment | |
Rouba et al. | Weighted clustering ensemble: Towards learning the weights of the base clusterings | |
CN108090182B (en) | A kind of distributed index method and system of extensive high dimensional data | |
Lu et al. | Influence model of paper citation networks with integrated pagerank and HITS | |
CN117725555B (en) | Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium | |
Hasanpour et al. | Optimal selection of ensemble classifiers using particle swarm optimization and diversity measures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181123 |
|
RJ01 | Rejection of invention patent application after publication |