CN108874990A - A kind of method and system extracted based on power technology journal article unstructured data - Google Patents

A kind of method and system extracted based on power technology journal article unstructured data Download PDF

Info

Publication number
CN108874990A
CN108874990A CN201810600133.0A CN201810600133A CN108874990A CN 108874990 A CN108874990 A CN 108874990A CN 201810600133 A CN201810600133 A CN 201810600133A CN 108874990 A CN108874990 A CN 108874990A
Authority
CN
China
Prior art keywords
paper
data
module
character
forward position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810600133.0A
Other languages
Chinese (zh)
Inventor
亓富军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201810600133.0A priority Critical patent/CN108874990A/en
Publication of CN108874990A publication Critical patent/CN108874990A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to technical field of data processing, disclose a kind of method and system extracted based on power technology journal article unstructured data, and data extraction system includes:Input module, search module, big data analysis module, extraction module, data memory module.The present invention generates new search result and offer according to the corresponding entity identifier of property parameters by search module, i.e. since property parameters can be used as an entity, by the way that entity is converted to entity identifier, according to the uniqueness of entity identifier, obtain corresponding search result, it thoroughly solves the problems, such as entity is born the same name, the part of long search term matches etc., improves the accuracy of search result, improve data extraction efficiency;The response time that inquiry can be effectively reduced by big data analysis module simultaneously can quickly analyze paper data element, improve data extraction rate.

Description

A kind of method and system extracted based on power technology journal article unstructured data
Technical field
The invention belongs to technical field of data processing, more particularly to one kind to be based on the unstructured number of power technology journal article According to the method and system of extraction.
Background technique
Currently, the prior art commonly used in the trade is such:
With the fast development of computer technology and internet, online enquiries, retrieval and downloading expert data have become and work as The important means of preceding science and technology information retrieval, for online all kinds of full-text databases or Digest Database, the rope of the abstract of a thesis Draw be reader's searching document important tool, for scientific and technological information Literature retrieval construction and maintenance provide conveniently.Abstract It is the introduction comprehensive to paper, makes one to understand the main contents that paper illustrates.After paper publishing, abstract journal or various databases Abstract can not be made an amendment or be modified slightly and directly utilized, allow reader to understand the main contents of paper as early as possible, with adjunctive program The deficiency of name, so that other people be avoided to write, abstract is issuable to be misread, is short of even mistake.So the quality of the abstract of a thesis is high It is low, it directly affects being retrieved for paper and rate and is drawn the frequency.However, tradition leans on the search of keyword match mode, can not solve Certainly part matching problem, may be inaccurate so as to cause search result, influences data extraction efficiency;Meanwhile it cannot provide more The information about development grain between paper, which results in researchers can not efficiently obtain Scientific Research Resource, to opinion Literary data analyzing speed is slow.
In conclusion problem of the existing technology is:
Tradition leans on the search of keyword match mode, can not solve part matching problem, may so as to cause search result Inaccurately, data extraction efficiency is influenced;
Meanwhile the more information about development grain between paper cannot be provided, which results in researchers cannot It is enough efficiently to obtain Scientific Research Resource, it is slow to paper data analyzing speed.
Under complex data environment, the initial characteristics collection dimension after feature extraction is continuously increased with fusion feature dimension May be very high, then information redundancy is certainly existed between feature, the effect of classification is deteriorated;And in existing Feature Extraction Technology, all It is that will minimize the scale of character subset as another optimization aim, still on the basis of existing single object optimization technology The scale of character subset is a dispersive target, and the solution usually acquired, which is concentrated, can only correspond to a solution under each Feature-scale, this So that the other feature subset that scale is identical but specific features are different can not be found.And these character subset dimensions are also for letter Number feature extraction is also useful.In addition, it is a series of compromise solution that multiple target feature selecting algorithm is finally obtained, need from The middle solution for choosing function admirable, but it is currently available that unsupervised approaches are also less.Main difficulty is:Designed feature Collection evaluation function and search strategy fail to consider the redundancy and correlation of character subset;Interpretational criteria does not also consider character subset Dimension selection to classification validity influence;The Pareto solution of multi-objective optimization algorithm concentrates unsupervised mode to extract spy The importance sorting for levying dimension and character subset is still unresolved.
Summary of the invention
In view of the problems of the existing technology, the present invention provides one kind to be based on the unstructured number of power technology journal article According to the method for extraction.
The invention is realized in this way a method of it is extracted based on power technology journal article unstructured data, packet It includes:
Title and the paper path of paper are inputted by input module;
Papers contents critical data information is searched for by search module;
After local paper data set is analyzed and is handled accordingly by big data analysis module in the database Paper citation network is constructed, papers contents and correlative theses are analyzed;
By extraction module initialization data memory module, the forward position Pareto point quantity is less than default value R, then directly will All the points are stored in data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance It removes, until the forward position the Pareto point quantity of alternative deposit data memory module is equal with default value;Then by these forward position points It is stored in data memory module;Authors of Science Articles information, abstract, keyword core data information are extracted again;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object, Indicate the maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
The core data information of the author information extracted, abstract, keyword is stored by data memory module.
Further, the extracting method of extraction module specifically includes:
Step 1 calculates the forward position Pareto point, calculates radar signal feature with redundancy objective function according to the degree of correlation The fitness of body, and the forward position the Pareto point in current character individual is found out, time complexity is O (N2);
Step 2, initialization data memory module, the forward position Pareto point quantity are less than default value R, then will directly own In point deposit data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5) The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit number It is equal with default value according to the forward position the Pareto point quantity of memory module;Then by these forward positions, point is stored in data memory module In;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object, Indicate the maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
Step 3 calls splitting rule to create underlying membrane, after completing preparation, starts division in the film of surface layer and generates M Underlying membrane;It is equal with the forward position the Pareto point quantity of data memory module to divide underlying membrane quantity M;Then these are achieved Optimum individual of the forward position the Pareto point as population in the underlying membrane;Finally, being put into remaining each individual apart from itself recently The forward position Pareto point where in underlying membrane, time complexity is O (N × R);
Step 4, independently executes particle swarm algorithm in underlying membrane, in each underlying membrane, to be stored in data memory module at first The interior forward position Pareto point is population optimum individual, formula (3) Xt+1=Xt+Vt+1With and formulaWith formula (4) Π=(V, T, C, μ, ω1,…,ωm,(R11),… (Rmm)), calculate new individual speed and position.And fitness is recalculated according to newest position;Wherein, in formula (3), Vt, Vt+1It is the speed of t and the t+1 times flight respectively;Xt, Xt+1It is that particle is fallen in after t and the t+1 times flight respectively Position;In formula (4), V is alphabet, and included element is character object;It is to intracellular metabolic element, object Matter is abstracted;For output alphabet;For catalyst, these elements do not become during Cellular evolution Change, does not also generate new character;But must have its participation in certain evolutionary rules could execute, will if there is no rule It can not be performed;μ is the membrane structure comprising m film, and each film and its region enclosed are indicated with label set H, H=1,2 ..., M }, wherein m is known as the degree of the membranous system;ωi∈ V* (1≤i≤m) indicates to contain the more of object inside the region i in membrane structure μ Collect again, V* is the set for any character object that character forms in V;Ri(1≤i≤m) is the finite aggregate of evolutionary rule, each RiIt is, ρ associated with the region i in membrane structure uiIt is RiIn partial ordering relation, referred to as dominance relation indicates rule RiIt executes Dominance relation.RiEvolutionary rule be binary group (u, v), be generally written into u → v,In v character may belong to V can also To be not belonging to V, but when producing the character object for being not belonging to V after certain rule executes, it is dissolved to execute the rule caudacoria;U's The number of character object contained by length, that is, u is known as the radius of rule u → v;
Step 5, dissolution;After completing respective particle swarm algorithm, each underlying membrane rupture, by newly generated character individual It is re-released into the film of surface layer;
Step 6 calculates forward position point, is put into data memory module;It is all in calculating step 5 to be released to surface layer film character The forward position Pareto point;And these are put in deposit data memory module;
Step 7 calculates non-dominated ranking, updates data memory module, judges whether data memory module character quantity surpasses Limit out, if beyond limitation, again in archives all characters crowding distance;It is deleted one by one since the smallest point of crowding distance It removes, until character quantity is equal with default value in data memory module, time complexity is O (D × 2R × log (2R));
Step 8, iteration judge whether current state meets the condition of end loop;If conditions are not met, then continuing to execute Step 3;If it is satisfied, executing all character steps in output external archival;
It needs to carry out initialization and Fitness analysis before calculating the forward position Pareto point;N number of character is generated in the film of surface layer, is indicated The radar emitter signal feature set number of extraction, each character include that D ties up variable, and is meeting multi-objective optimization question constraint Under the premise of condition, successively N number of character is initialized, coding mode uses binary coding mode;Individual x={ x1, x2,...,xDValue range { 0,1 }, when value is 1, this feature is selected;When initialization, calculates all sample values and exist Then variance in each feature calculates selected probability according to following formula;
vjIndicate the variance of all sample values on jth dimensional feature;When P is greater than 0.5, this feature is easy to choose.
Further, described search block search method is as follows:
Firstly, receiving search term;
Then, multiple search results are generated according to described search word and provided, wherein each search result includes multiple categories Property parameter, wherein in the multiple property parameters at least part property parameters have corresponding entity identifier;Entity identifier Property parameters include author's name and/or deliver place;
Finally, when the property parameters in search result are triggered, it is raw according to the corresponding entity identifier of the property parameters The search result of Cheng Xin and offer;
The big data analysis module analysis method is as follows:
Step 1:Paper reference is constructed after local paper data set is analyzed and handled accordingly in the database Network;
Step 2:According to the adduction relationship creation analysis algorithm in paper citation network, by described in parser acquisition The importance and mutual relationship of paper citation network interior joint, and obtain different degree of the paper relative to center paper;Institute The center paper of stating refers to:The a certain piece paper that user is inquired by input;
Step 3:Convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping in the direction that is cited Collection obtains the development path between specified paper in the paper citation network, and according to the paper different degree obtained in step 2 To calculate the different degree in path.
Further, before the reception search term, further include:
Obtain multiple papers;
Extract mechanism locating for the corresponding author's name of each paper and author respectively from the multiple paper;
If the corresponding author's name of paper be it is unique, the entity identifier is generated according to the author's name;And
If the corresponding author's name of paper is not uniquely, according to the life of mechanism locating for the author's name and the author At the entity identifier.
The step 1 includes:
Step 1.1:Using text-processing and analytical technology, the reference information in local paper data, reference information are extracted It is the information which paper is referred to comprising any paper in collection of thesis;
Step 1.2:Construct paper citation network;
Step 1.3:Duplicate content, utilization are removed after being compared to the reference information in the paper citation network of acquisition Database software is stored and is established index, and the adduction relationship between paper is stored in database in the form of key-value pair In.
Further, the step 2 includes:
Step 2.1:According to the adduction relationship in paper citation network, the score based on Crosslinking Structural is calculated;
Step 2.2:It is opposite to calculate each paper for the subgraph paid close attention to using breadth first algorithm search spread user In the longest path of center paper and the ratio of shortest path, as the score based on reference step analysis, breadth first algorithm Calculation formula it is as follows:
Score=longest path/shortest path based on reference step analysis;
Step 2.3:Corresponding weight is selected, by the score based on Crosslinking Structural and point based on reference step analysis Different degree of the number included together as other final papers relative to center paper, calculation formula are as follows:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference level point Score of the weight * of analysis based on reference step analysis;
The step 3 includes:
Step 3.1:The mapping ensemblen in reference direction is converted by the one-to-one adduction relationship of paper in database and is drawn With the mapping ensemblen in direction;
Step 3.2:The adduction relationship of preliminary analysis paper takes python program design language to call the turn the data knot of dictionary Structure;
Step 3.3:Extract the information in path between two papers.
Another object of the present invention, which is to provide, to be mentioned described in a kind of realization based on power technology journal article unstructured data The computer program of the method taken.
Another object of the present invention, which is to provide, to be mentioned described in a kind of realization based on power technology journal article unstructured data The information data processing terminal of the method taken.
Another object of the present invention is to provide a kind of computer readable storage medium, including instruction, when its on computers When operation, so that computer executes the method extracted based on power technology journal article unstructured data.
Another object of the present invention, which is to provide, to be mentioned described in a kind of realization based on power technology journal article unstructured data The method taken based on power technology journal article unstructured data extraction system, including:
Input module is connect with search module, for inputting title and the paper path of paper;
Search module is connect with input module, big data analysis module, for searching for papers contents critical data information;
Big data analysis module is connect with search module, extraction module, for dividing papers contents and correlative theses Analysis;
Extraction module is connect with big data analysis module, data memory module, for extract Authors of Science Articles information, abstract, The core datas information such as keyword;
Data memory module is connect with extraction module, for by cores such as the author information extracted, abstract, keywords Data information is stored.
Another object of the present invention is to be based on power technology journal article unstructured data described in providing one kind is equipped with The data extraction device of extraction system.
Advantages of the present invention and good effect are:
The present invention can first receive search term by search module, generate multiple search results according to search term later and mention For, wherein each search result includes multiple property parameters, and at least part property parameters have correspondence in multiple property parameters Entity identifier generated according to the corresponding entity identifier of property parameters new when the property parameters in search result are triggered Search result simultaneously provides, i.e., since property parameters can be used as an entity, by the way that entity is converted to entity identifier, according to entity The uniqueness of mark, obtains corresponding search result, thoroughly solves entity duplication of name, part matching of long search term etc. Problem improves the accuracy of search result, improves data extraction efficiency;It simultaneously can be more effective by big data analysis module Ground shows the contribution and relative importance that other papers make the paper of user query, enables staff more easily From an interested paper, other correlative theses are found;The distributed approach of big data analysis, this is beneficial to Improve the speed of academic big data analysis;When establishing paper inquiry system using the present invention, the distributed treatment is utilized The response time of inquiry can be effectively reduced in method, can quickly analyze paper data element, improves data and extracts speed Degree.
Feature extracting method of the invention is compared with MOPSO, SPEA2, PESA2 algorithm, and inventive algorithm has essence The performances such as exactness, convergence rate, distribution of results uniformity coefficient.Has fast convergence rate, the approximate forward position Pareto point distribution is equal It the features such as even, being capable of the preferable forward position approaching to reality Pareto.Therefore, it can prove that new algorithm is solving multi-objective optimization question Aspect is feasible, effective.
The extracted important feature subset of multiple target feature selecting of the invention shows good in SNR=4dB or more Cluster property can obviously divide between signal, sharpness of border no overlap, can simplify the design of sorter, improve sorting discrimination, Be conducive to practical application.Finally independent 100 times are carried out to signal characteristic subset using traditional FCM clustering algorithm to test, MPSO, The average cluster accuracy that NSGAII and SPEA2 algorithm obtains is respectively 99%, 82%, 78%.Illustrate proposed algorithm Recovery rate with higher.
Detailed description of the invention
Fig. 1 is the method flow extracted based on power technology journal article unstructured data that the present invention implements to provide Figure.
Fig. 2 is present invention implementation offer based on power technology journal article unstructured data extraction system structural frames Figure.
In Fig. 2:1, input module;2, search module;3, big data analysis module;4, extraction module;5, data store mould Block.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
With reference to the accompanying drawing and specific embodiment is further described application principle of the invention.
As shown in Figure 1, a kind of method extracted based on power technology journal article unstructured data provided by the invention Include the following steps:
S101 inputs title and the paper path of paper by input module;
S102 searches for papers contents critical data information by search module;
S103 analyzes papers contents and correlative theses by big data analysis module;
S104 extracts the core datas information such as Authors of Science Articles information, abstract, keyword by extraction module;
S105 is carried out the core datas information such as the author information extracted, abstract, keyword by data memory module Storage.
As shown in Fig. 2, provided in an embodiment of the present invention be based on power technology journal article unstructured data extraction system, Including:Input module 1, search module 2, big data analysis module 3, extraction module 4, data memory module 5.
Input module 1 is connect with search module 2, for inputting title and the paper path of paper;
Search module 2 is connect with input module 1, big data analysis module 3, for searching for papers contents critical data letter Breath;
Big data analysis module 3 is connect with search module 2, extraction module 4, for papers contents and correlative theses into Row analysis;
Extraction module 4 is connect with big data analysis module 3, data memory module 5, for extracting Authors of Science Articles information, plucking It wants, the core datas information such as keyword;
Data memory module 5 is connect with extraction module 4, for by cores such as the author information extracted, abstract, keywords Heart data information is stored.
Below with reference to concrete analysis, the invention will be further described.
The method provided in an embodiment of the present invention extracted based on power technology journal article unstructured data, including:
The extracting method of extraction module specifically includes:
Step 1 calculates the forward position Pareto point, calculates radar signal feature with redundancy objective function according to the degree of correlation The fitness of body, and the forward position the Pareto point in current character individual is found out, time complexity is O (N2);
Step 2, initialization data memory module, the forward position Pareto point quantity are less than default value R, then will directly own In point deposit data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5) The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit number It is equal with default value according to the forward position the Pareto point quantity of memory module;Then by these forward positions, point is stored in data memory module In;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object, Indicate the maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
Step 3 calls splitting rule to create underlying membrane, after completing preparation, starts division in the film of surface layer and generates M Underlying membrane;It is equal with the forward position the Pareto point quantity of data memory module to divide underlying membrane quantity M;Then these are achieved Optimum individual of the forward position the Pareto point as population in the underlying membrane;Finally, being put into remaining each individual apart from itself recently The forward position Pareto point where in underlying membrane, time complexity is O (N × R);
Step 4, independently executes particle swarm algorithm in underlying membrane, in each underlying membrane, to be stored in data memory module at first The interior forward position Pareto point is population optimum individual, formula (3) Xt+1=Xt+Vt+1With and formulaWith formula (4) Π=(V, T, C, μ, ω1,…,ωm,(R11), (Rmm)), calculate new individual speed and position.And fitness is recalculated according to newest position;Wherein, in formula (3), Vt, Vt+1It is the speed of t and the t+1 times flight respectively;Xt, Xt+1It is that particle is fallen in after t and the t+1 times flight respectively Position;In formula (4), V is alphabet, and included element is character object;It is to intracellular metabolic element, object Matter is abstracted;For output alphabet;For catalyst, these elements do not become during Cellular evolution Change, does not also generate new character;But must have its participation in certain evolutionary rules could execute, will if there is no rule It can not be performed;μ is the membrane structure comprising m film, and each film and its region enclosed are indicated with label set H, H=1,2 ..., M }, wherein m is known as the degree of the membranous system;ωi∈ V* (1≤i≤m) indicates to contain the more of object inside the region i in membrane structure μ Collect again, V* is the set for any character object that character forms in V;Ri(1≤i≤m) is the finite aggregate of evolutionary rule, each RiIt is, ρ associated with the region i in membrane structure uiIt is RiIn partial ordering relation, referred to as dominance relation indicates rule RiIt executes Dominance relation.RiEvolutionary rule be binary group (u, v), be generally written into u → v,In v character may belong to V can also To be not belonging to V, but when producing the character object for being not belonging to V after certain rule executes, it is dissolved to execute the rule caudacoria;U's The number of character object contained by length, that is, u is known as the radius of rule u → v;
Step 5, dissolution;After completing respective particle swarm algorithm, each underlying membrane rupture, by newly generated character individual It is re-released into the film of surface layer;
Step 6 calculates forward position point, is put into data memory module;It is all in calculating step 5 to be released to surface layer film character The forward position Pareto point;And these are put in deposit data memory module;
Step 7 calculates non-dominated ranking, updates data memory module, judges whether data memory module character quantity surpasses Limit out, if beyond limitation, again in archives all characters crowding distance;It is deleted one by one since the smallest point of crowding distance It removes, until character quantity is equal with default value in data memory module, time complexity is O (D × 2R × log (2R));
Step 8, iteration judge whether current state meets the condition of end loop;If conditions are not met, then continuing to execute Step 3;If it is satisfied, executing all character steps in output external archival;
It needs to carry out initialization and Fitness analysis before calculating the forward position Pareto point;N number of character is generated in the film of surface layer, is indicated The radar emitter signal feature set number of extraction, each character include that D ties up variable, and is meeting multi-objective optimization question constraint Under the premise of condition, successively N number of character is initialized, coding mode uses binary coding mode;Individual x={ x1, x2,...,xDValue range { 0,1 }, when value is 1, this feature is selected;When initialization, calculates all sample values and exist Then variance in each feature calculates selected probability according to following formula;
vjIndicate the variance of all sample values on jth dimensional feature;When P is greater than 0.5, this feature is easy to choose.
Described search block search method is as follows:
Firstly, receiving search term;
Then, multiple search results are generated according to described search word and provided, wherein each search result includes multiple categories Property parameter, wherein in the multiple property parameters at least part property parameters have corresponding entity identifier;Entity identifier Property parameters include author's name and/or deliver place;
Finally, when the property parameters in search result are triggered, it is raw according to the corresponding entity identifier of the property parameters The search result of Cheng Xin and offer;
The big data analysis module analysis method is as follows:
Step 1:Paper reference is constructed after local paper data set is analyzed and handled accordingly in the database Network;
Step 2:According to the adduction relationship creation analysis algorithm in paper citation network, by described in parser acquisition The importance and mutual relationship of paper citation network interior joint, and obtain different degree of the paper relative to center paper;Institute The center paper of stating refers to:The a certain piece paper that user is inquired by input;
Step 3:Convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping in the direction that is cited Collection obtains the development path between specified paper in the paper citation network, and according to the paper different degree obtained in step 2 To calculate the different degree in path.
Before the reception search term, further include:
Obtain multiple papers;
Extract mechanism locating for the corresponding author's name of each paper and author respectively from the multiple paper;
If the corresponding author's name of paper be it is unique, the entity identifier is generated according to the author's name;And
If the corresponding author's name of paper is not uniquely, according to the life of mechanism locating for the author's name and the author At the entity identifier.
The step 1 includes:
Step 1.1:Using text-processing and analytical technology, the reference information in local paper data, reference information are extracted It is the information which paper is referred to comprising any paper in collection of thesis;
Step 1.2:Construct paper citation network;
Step 1.3:Duplicate content, utilization are removed after being compared to the reference information in the paper citation network of acquisition Database software is stored and is established index, and the adduction relationship between paper is stored in database in the form of key-value pair In.
Further, the step 2 includes:
Step 2.1:According to the adduction relationship in paper citation network, the score based on Crosslinking Structural is calculated;
Step 2.2:It is opposite to calculate each paper for the subgraph paid close attention to using breadth first algorithm search spread user In the longest path of center paper and the ratio of shortest path, as the score based on reference step analysis, breadth first algorithm Calculation formula it is as follows:
Score=longest path/shortest path based on reference step analysis;
Step 2.3:Corresponding weight is selected, by the score based on Crosslinking Structural and point based on reference step analysis Different degree of the number included together as other final papers relative to center paper, calculation formula are as follows:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference level point Score of the weight * of analysis based on reference step analysis;
The step 3 includes:
Step 3.1:The mapping ensemblen in reference direction is converted by the one-to-one adduction relationship of paper in database and is drawn With the mapping ensemblen in direction;
Step 3.2:The adduction relationship of preliminary analysis paper takes python program design language to call the turn the data knot of dictionary Structure;
Step 3.3:Extract the information in path between two papers.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of method extracted based on power technology journal article unstructured data, which is characterized in that described to be based on electric power Technical journal paper unstructured data extract method include:
Papers contents critical data information is searched for by search module;
It is constructed in the database after local paper data set is analyzed and handled accordingly by big data analysis module Paper citation network, analyzes papers contents and correlative theses;
By extraction module initialization data memory module, the forward position Pareto point quantity is less than default value R, then will directly own In point deposit data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5) The crowding distance for calculating all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit number It is equal with default value according to the forward position the Pareto point quantity of memory module;Then by these forward positions, point is stored in data memory module In;Authors of Science Articles information, abstract, keyword core data information are extracted again;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,It indicates The maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
The core data information of the author information extracted, abstract, keyword is stored by data memory module.
2. the method extracted as described in claim 1 based on power technology journal article unstructured data, which is characterized in that The extracting method of extraction module specifically includes:
Step 1 calculates the forward position Pareto point, calculates radar signal characteristic individual with redundancy objective function according to the degree of correlation Fitness, and the forward position the Pareto point in current character individual is found out, time complexity is O (N2);
Step 2, initialization data memory module, the forward position Pareto point quantity are less than default value R, then directly deposit all the points Enter in data memory module;The forward position Pareto point quantity is greater than default value, according to formula (5)It calculates The crowding distance of all forward position Pareto points is deleted one by one since the smallest point of crowding distance, until alternative deposit data are deposited The forward position the Pareto point quantity for storing up module is equal with default value;Then by these forward positions, point is stored in data memory module;
In formula, n indicates the number of objective function, diIndicate the crowding distance in population of i-th of character object,It indicates The maximum value that m-th of objective function obtains in population,Indicate the minimum value that m-th of objective function obtains in population,WithIt is m-th target function value of i-th of character object in the dimension two sides m closest to point, wherein
Step 3 calls splitting rule to create underlying membrane, and after completing preparation, it is basic to start division generation M in the film of surface layer Film;It is equal with the forward position the Pareto point quantity of data memory module to divide underlying membrane quantity M;Then by the Pareto of these archives Optimum individual of the forward position point as population in the underlying membrane;Finally, remaining each individual is put into nearest apart from itself Where the point of the forward position Pareto in underlying membrane, time complexity is O (N × R);
Step 4, independently executes particle swarm algorithm in underlying membrane, in each underlying membrane, to be stored in data memory module at first The forward position Pareto point is population optimum individual, formula (3) Xt+1=Xt+Vt+1With and formulaWith formula (4) Π=(V, T, C, μ, ω1,…,ωm,(R11),… (Rmm)), calculate new individual speed and position.And fitness is recalculated according to newest position;Wherein, in formula (3), Vt, Vt+1It is the speed of t and the t+1 times flight respectively;Xt, Xt+1It is that particle is fallen in after t and the t+1 times flight respectively Position;In formula (4), V is alphabet, and included element is character object;It is to intracellular metabolic element, object Matter is abstracted;For output alphabet;For catalyst, these elements do not become during Cellular evolution Change, does not also generate new character;But must have its participation in certain evolutionary rules could execute, will if there is no rule It can not be performed;μ is the membrane structure comprising m film, and each film and its region enclosed are indicated with label set H, H=1,2 ..., M }, wherein m is known as the degree of the membranous system;ωi∈V*(1≤i≤m) indicates to contain the more of object inside the region i in membrane structure μ Collect again, V*It is the set for any character object that character forms in V;Ri(1≤i≤m) is the finite aggregate of evolutionary rule, each RiIt is, ρ associated with the region i in membrane structure uiIt is RiIn partial ordering relation, referred to as dominance relation indicates rule RiIt executes Dominance relation.RiEvolutionary rule be binary group (u, v), be generally written into u → v,In v character may belong to V can also To be not belonging to V, but when producing the character object for being not belonging to V after certain rule executes, it is dissolved to execute the rule caudacoria;U's The number of character object contained by length, that is, u is known as the radius of rule u → v;
Step 5, dissolution;After completing respective particle swarm algorithm, each underlying membrane rupture, again by newly generated character individual It is discharged into the film of surface layer;
Step 6 calculates forward position point, is put into data memory module;Calculate all surface layer film characters of being released in step 5 The forward position Pareto point;And these are put in deposit data memory module;
Step 7 calculates non-dominated ranking, updates data memory module, judges whether data memory module character quantity exceeds limit System, if beyond limitation, again in archives all characters crowding distance;It is deleted one by one since the smallest point of crowding distance, Until character quantity is equal with default value in data memory module, time complexity is O (D × 2R × log (2R));
Step 8, iteration judge whether current state meets the condition of end loop;If conditions are not met, then continuing to execute step Three;If it is satisfied, executing all character steps in output external archival;
It needs to carry out initialization and Fitness analysis before calculating the forward position Pareto point;N number of character is generated in the film of surface layer, indicates to extract Radar emitter signal feature set number, each character includes that D ties up variable, and is meeting multi-objective optimization question constraint condition Under the premise of, successively N number of character is initialized, coding mode uses binary coding mode;Individual x={ x1,x2,..., xDValue range { 0,1 }, when value is 1, this feature is selected;When initialization, all sample values are calculated in each spy Then variance in sign calculates selected probability according to following formula;
vjIndicate the variance of all sample values on jth dimensional feature;When P is greater than 0.5, this feature is easy to choose.
3. the method extracted as described in claim 1 based on power technology journal article unstructured data, which is characterized in that Described search block search method is as follows:
Firstly, receiving search term;
Then, multiple search results are generated according to described search word and provided, wherein each search result includes multiple attribute ginsengs Number, wherein at least part property parameters have corresponding entity identifier in the multiple property parameters;The attribute of entity identifier Parameter includes author's name and/or delivers place;
Finally, being generated according to the corresponding entity identifier of the property parameters new when the property parameters in search result are triggered Search result and offer;
The big data analysis module analysis method is as follows:
Step 1:Paper citation network is constructed in the database after local paper data set is analyzed and handled accordingly;
Step 2:According to the adduction relationship creation analysis algorithm in paper citation network, the paper is obtained by the parser The importance of citation network interior joint and mutual relationship, and obtain different degree of the paper relative to center paper;In described Heart paper refers to:The a certain piece paper that user is inquired by input;
Step 3:Convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping ensemblen in the direction that is cited, The development path between specified paper is obtained in the paper citation network, and is counted according to the paper different degree obtained in step 2 Calculate the different degree in path.
4. the method extracted as claimed in claim 3 based on power technology journal article unstructured data, which is characterized in that
Before the reception search term, further include:
Obtain multiple papers;
Extract mechanism locating for the corresponding author's name of each paper and author respectively from the multiple paper;
If the corresponding author's name of paper be it is unique, the entity identifier is generated according to the author's name;And
If the corresponding author's name of paper is not uniquely, to generate institute according to mechanism locating for the author's name and the author State entity identifier.
The step 1 includes:
Step 1.1:Using text-processing and analytical technology, the reference information in local paper data is extracted, reference information is packet The information of which paper is referred to containing a paper any in collection of thesis;
Step 1.2:Construct paper citation network;
Step 1.3:Duplicate content is removed after being compared to the reference information in the paper citation network of acquisition, utilizes data Library software is stored and is established index, and the adduction relationship between paper is stored in lane database in the form of key-value pair.
5. the method extracted as claimed in claim 3 based on power technology journal article unstructured data, which is characterized in that
The step 2 includes:
Step 2.1:According to the adduction relationship in paper citation network, the score based on Crosslinking Structural is calculated;
Step 2.2:The subgraph paid close attention to using breadth first algorithm search spread user calculates each paper relative in The longest path of heart paper and the ratio of shortest path, as the score based on reference step analysis, the meter of breadth first algorithm It is as follows to calculate formula:
Score=longest path/shortest path based on reference step analysis;
Step 2.3:Corresponding weight is selected, the score based on Crosslinking Structural and the score based on reference step analysis are closed The different degree as other final papers relative to center paper, calculation formula are as follows together:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference step analysis Score of the weight * based on reference step analysis;
The step 3 includes:
Step 3.1:Convert the one-to-one adduction relationship of paper in database to mapping ensemblen and the side of being cited in reference direction To mapping ensemblen;
Step 3.2:The adduction relationship of preliminary analysis paper takes python program design language to call the turn the data structure of dictionary;
Step 3.3:Extract the information in path between two papers.
6. a kind of realize described in Claims 1 to 5 any one based on the extraction of power technology journal article unstructured data The computer program of method.
7. a kind of realize described in Claims 1 to 5 any one based on the extraction of power technology journal article unstructured data The information data processing terminal of method.
8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires the method extracted described in 1-5 any one based on power technology journal article unstructured data.
9. it is a kind of realize described in claim 1 based on power technology journal article unstructured data extract method based on electricity Power technical journal paper unstructured data extraction system, which is characterized in that described non-structural based on power technology journal article Changing data extraction system includes:
Input module is connect with search module, for inputting title and the paper path of paper;
Search module is connect with input module, big data analysis module, for searching for papers contents critical data information;
Big data analysis module is connect with search module, extraction module, for analyzing papers contents and correlative theses;
Extraction module is connect with big data analysis module, data memory module, for extracting Authors of Science Articles information, abstract, key The core datas information such as word;
Data memory module is connect with extraction module, for by core datas such as the author information extracted, abstract, keywords Information is stored.
10. a kind of data being equipped with based on power technology journal article unstructured data extraction system described in claim 9 Extract equipment.
CN201810600133.0A 2018-06-12 2018-06-12 A kind of method and system extracted based on power technology journal article unstructured data Pending CN108874990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810600133.0A CN108874990A (en) 2018-06-12 2018-06-12 A kind of method and system extracted based on power technology journal article unstructured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810600133.0A CN108874990A (en) 2018-06-12 2018-06-12 A kind of method and system extracted based on power technology journal article unstructured data

Publications (1)

Publication Number Publication Date
CN108874990A true CN108874990A (en) 2018-11-23

Family

ID=64337997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810600133.0A Pending CN108874990A (en) 2018-06-12 2018-06-12 A kind of method and system extracted based on power technology journal article unstructured data

Country Status (1)

Country Link
CN (1) CN108874990A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109557529A (en) * 2018-11-28 2019-04-02 中国人民解放军国防科技大学 Radar target detection method based on generalized Pareto distribution clutter statistical modeling
CN110188928A (en) * 2019-05-15 2019-08-30 张会丽 A kind of the formative optimization system and method for cloud data education training process
CN110245208A (en) * 2019-04-30 2019-09-17 广东省智能制造研究所 A kind of retrieval analysis method, apparatus and medium based on big data storage
CN111177323A (en) * 2019-12-31 2020-05-19 国网安徽省电力有限公司安庆供电公司 Power failure plan unstructured data extraction and identification method based on artificial intelligence
CN112580305A (en) * 2019-09-11 2021-03-30 陈涛 Method for providing writing guide for writing and word processing equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100100562A1 (en) * 2008-10-01 2010-04-22 Jerry Millsap Fully Parameterized Structured Query Language
CN103279506A (en) * 2013-05-15 2013-09-04 云南电力试验研究院(集团)有限公司电力研究院 Method for extracting journal paper unstructured data based on electric power technology
CN104239570A (en) * 2014-09-30 2014-12-24 百度在线网络技术(北京)有限公司 Method and device for searching for paper
CN104376038A (en) * 2014-09-12 2015-02-25 中国人民解放军信息工程大学 Position associated text information visualization method based on label cloud
CN105808729A (en) * 2016-03-08 2016-07-27 上海交通大学 Academic big data analysis method based on reference relationship among pieces of thesis
CN107590436A (en) * 2017-08-10 2018-01-16 云南财经大学 Radar emitter signal feature selection approach based on peplomer subgroup multi-objective Algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100100562A1 (en) * 2008-10-01 2010-04-22 Jerry Millsap Fully Parameterized Structured Query Language
CN103279506A (en) * 2013-05-15 2013-09-04 云南电力试验研究院(集团)有限公司电力研究院 Method for extracting journal paper unstructured data based on electric power technology
CN104376038A (en) * 2014-09-12 2015-02-25 中国人民解放军信息工程大学 Position associated text information visualization method based on label cloud
CN104239570A (en) * 2014-09-30 2014-12-24 百度在线网络技术(北京)有限公司 Method and device for searching for paper
CN105808729A (en) * 2016-03-08 2016-07-27 上海交通大学 Academic big data analysis method based on reference relationship among pieces of thesis
CN107590436A (en) * 2017-08-10 2018-01-16 云南财经大学 Radar emitter signal feature selection approach based on peplomer subgroup multi-objective Algorithm

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109557529A (en) * 2018-11-28 2019-04-02 中国人民解放军国防科技大学 Radar target detection method based on generalized Pareto distribution clutter statistical modeling
CN109557529B (en) * 2018-11-28 2019-08-30 中国人民解放军国防科技大学 Radar target detection method based on generalized Pareto distribution clutter statistical modeling
CN110245208A (en) * 2019-04-30 2019-09-17 广东省智能制造研究所 A kind of retrieval analysis method, apparatus and medium based on big data storage
CN110188928A (en) * 2019-05-15 2019-08-30 张会丽 A kind of the formative optimization system and method for cloud data education training process
CN112580305A (en) * 2019-09-11 2021-03-30 陈涛 Method for providing writing guide for writing and word processing equipment
CN111177323A (en) * 2019-12-31 2020-05-19 国网安徽省电力有限公司安庆供电公司 Power failure plan unstructured data extraction and identification method based on artificial intelligence
CN111177323B (en) * 2019-12-31 2022-04-01 国网安徽省电力有限公司安庆供电公司 Power failure plan unstructured data extraction and identification method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN108874990A (en) A kind of method and system extracted based on power technology journal article unstructured data
Liu et al. Text features extraction based on TF-IDF associating semantic
US10459971B2 (en) Method and apparatus of generating image characteristic representation of query, and image search method and apparatus
CN111104511B (en) Method, device and storage medium for extracting hot topics
WO2022156328A1 (en) Restful-type web service clustering method fusing service cooperation relationships
Ren et al. Heterogeneous graph-based intent learning with queries, web pages and wikipedia concepts
Chatterjee et al. Single document extractive text summarization using genetic algorithms
CN109033314A (en) The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
Bounabi et al. A comparison of text classification methods using different stemming techniques
CN110309234A (en) A kind of client of knowledge based map holds position method for early warning, device and storage medium
Tang et al. Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.
Sutanto et al. Fine-grained document clustering via ranking and its application to social media analytics
Zaw et al. Web document clustering by using PSO-based cuckoo search clustering algorithm
CN109739984A (en) A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform
Yin et al. Sentence-bert and k-means based clustering technology for scientific and technical literature
Li et al. Extracting core questions in community question answering based on particle swarm optimization
Anupama et al. A novel approach using incremental oversampling for data stream mining
Weston et al. Latent structured ranking
CN106971011A (en) A kind of big data analysis method based on cloud platform
CN113157915A (en) Naive Bayes text classification method based on cluster environment
Rouba et al. Weighted clustering ensemble: Towards learning the weights of the base clusterings
CN108090182B (en) A kind of distributed index method and system of extensive high dimensional data
Lu et al. Influence model of paper citation networks with integrated pagerank and HITS
CN117725555B (en) Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium
Hasanpour et al. Optimal selection of ensemble classifiers using particle swarm optimization and diversity measures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181123

RJ01 Rejection of invention patent application after publication