CN109815315A - A kind of impurely block message comprehensive analysis method based on document - Google Patents
A kind of impurely block message comprehensive analysis method based on document Download PDFInfo
- Publication number
- CN109815315A CN109815315A CN201910084134.9A CN201910084134A CN109815315A CN 109815315 A CN109815315 A CN 109815315A CN 201910084134 A CN201910084134 A CN 201910084134A CN 109815315 A CN109815315 A CN 109815315A
- Authority
- CN
- China
- Prior art keywords
- document
- impurely
- analysis method
- comprehensive analysis
- contents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of impurely block message comprehensive analysis method based on document, the analysis method include the following steps: to obtain corresponding document from open platform retrieval by characteristic key words;The Digital Documents content in the document that retrieval obtains is read, and corresponding literature content is divided into multiple contents fragments;To the original contents in each contents fragment after division, the extraction of structural data is carried out to each contents fragment using the analysis system;Extracted impurely block structure information is inquired by database description language.Analysis method of the present invention, it can be for the magnanimity document in soil environment supervision area, by computer assisted form, efficiently extracts out the structured message of pollution plot pollution condition and stored using specialized database, it is efficient, accurate to have the characteristics that.
Description
Technical field
The invention belongs to soil pollution supervision area, in particular to a kind of impurely block message comprehensive analysis based on document
Method.
Background technique
In soil environment supervision area, researcher and technical staff are usually required to from the written historical materials of magnanimity quickly
And the contents such as corresponding plot soil pollution situation are accurately obtained, in order to carry out deep analysis to it or further know
Know discovery.So from substantial amounts, content complexity document in obtain structuring description content be pendulum in face of scientific research personnel
An important problem.
In existing working method, corresponding research contents is extracted from soil investigation document and is relied primarily on manually in document
Appearance is readed over, and is then labeled extraction to them again.This kind of working method needs much to have rich experiences and specially know
The sorter of knowledge does a large amount of work, not only of a high price, under efficiency, and is easy error.
Summary of the invention
In view of the above problems, the impurely block message synthesis point based on document that the object of the present invention is to provide a kind of
Analysis method.By computer assisted form, specification documents identification range and identification content, and it is aided with the side of secondary verification
Formula can be very good to solve the above problems.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of impurely block message overall analysis system based on document, including document basic information module, plot are basic
Information module, research object information module and pollutant monitoring and evaluation data information module;
The document basic information module, for obtain including name of document, Source Periodicals, deliver the time including document
Relevant information;
The plot basic information module, for determining that in target stains plot include geographical location, production information, dirt
Contaminate the relevant information including region area, plot planning purposes;
The research object information module, for obtaining the phase in the target plot including control time, detection method
Close information;
The pollutant monitoring and evaluation data module, the pollution condition for obtaining target stains plot pollutant are retouched
The property stated information.
A kind of impurely block message comprehensive analysis method based on document, comprising the following steps:
S1: obtaining corresponding document from open platform retrieval by characteristic key words, using document management software to retrieving
To document essential information be managed classification, establish Research Literature database;
According to research direction from multiple macroscopic views and/or microcosmic point selected characteristic keyword, the characteristic key words are packets
Include the Chinese and English keyword of correlation including place, pollution medium, industry type, organic pollutant category, pollutant;
S2: reading the Digital Documents content in the investigation class document that retrieval obtains, to all in literature content
Paragragh drops into capable pretreatment, by the natural paragraph according to structure phase character word, determines its description content, and be divided into including
Multiple contents fragments including plot essential information, research object essential information, pollutant monitoring and evaluation information, the feature
Word is in document for describing the phrase of corresponding chapters and sections content;
S3: to the original contents in each contents fragment after division, the extraction of structural data is carried out;
S4: extracted impurely block structure information is inquired by database description language.
Further, document essential information described in S1 includes piece name, journal title, author, abstract, keyword, delivers
Time and unit.
Further, pollution medium described in S1 includes impacted plot soil media and underground aqueous medium.
Further, plot essential information described in S2, research object essential information, pollutant monitoring and evaluation information point
The structure item Feature Words for not including are as shown in the table:
Further, the extraction of structural data is carried out described in S3 to each contents fragment specifically:
S31: the original contents in each contents fragment are converted into predefined reference format;
S32: by the reference format Content Organizing after conversion at structural data;
S33: data are verified by the form including artificial selective examination, machine learning.
Further, predefined reference format described in S31 includes the unification in geographical location or coordinate, each measurement unit
Unification and pollutant title unification.
The present invention having the beneficial effect that compared with prior art
Impurely block message comprehensive analysis method of the present invention based on document, can be for soil environment supervision neck
Magnanimity document in domain efficiently extracts out the structured message of pollution plot pollution condition by computer assisted form
And stored using specialized database, it is efficient, accurate to have the characteristics that.
The present invention will be further described in detail below based on the drawings and embodiments.
Detailed description of the invention
Fig. 1 is the method flow diagram of the impurely block message comprehensive analysis method of the present invention based on document.
Specific embodiment
The invention of this programme is described further combined with specific embodiments below, but protection scope of the present invention is not
It is limited to this.
Referring to Fig. 1, the present embodiment proposes that one kind obtains construction land organochlorine contamination plot structured message from document
Method, and be described further as follows:
S1, pollution plot is retrieved in the database including Web of Science, CNKI by particular keywords
Pertinent literature is studied, retrieval is obtained using document management software document essential information (including piece name, journal title, author, pluck
Want, keyword, deliver time and unit) it is managed classification, establish Research Literature database;
According to research direction from multiple macroscopic views and/or microcosmic point selected characteristic keyword, the characteristic key words are packets
Include place, pollution medium (including impacted plot soil media and underground aqueous medium), industry type, organic contamination species
The Chinese and English keyword of correlation including class, pollutant.
By taking construction land organic contamination as an example, the particular keywords for being included include:
Macroscopic aspect 1: place
Macroscopic aspect 2: pollution medium (soil and groundwater)
2.1 soil
2.2 underground water
Macroscopic aspect 3: trade classification (11 industries)
Macroscopic aspect 4: organic pollutant
4.1 general classifications (big classification)
4.2 volatile organic contaminants (specific)
4.3 half volatile organic contaminants (specific)
4.4 Polychlorinated biphenyls (specific)
4.5 polycyclic aromatic hydrocarbons (specific)
4.6 benzene homologues (specific)
4.7 total petroleum hydrocarbons (specific)
4.8 persistence organic pollutants (specific)
Macroscopic aspect 5: pollutant associated description
The Digital Documents content in the investigation class document that S2, reading retrieval obtain, to all in literature content
Paragragh drops into capable pretreatment, by the natural paragraph according to structure phase character word, determines its description content, and be divided into including
Multiple contents fragments including plot essential information, research object essential information, pollutant monitoring and evaluation information, the feature
Word is in document for describing the phrase of corresponding chapters and sections content;
Wherein, Feature Words are in document for describing the phrase of corresponding chapters and sections content, for example, according to " research area's overview " this
Paragraph natural where the specific word and its related natural paragraph can be divided into a contents fragment, this content by a Feature Words
Segment is used to describe the profile information of the pollution plot target.By the feature word list of setting electronic edition document XML
It is searched in information, and marks off the contents fragment of multiple and different description informations, these common contents fragments include that " plot is basic
Information, research object essential information, pollutant monitoring and evaluation information ".
The structure that the plot essential information, research object essential information, pollutant monitoring and evaluation information respectively include
Item Feature Words are as shown in the table:
S3, to the original contents in each contents fragment after division, using preset analysis system to each contents fragment into
Unprocessed form information is converted to predefined reference format by the extraction of row structural data, will be in the reference format after conversion
Appearance is organized into structural data;Data are verified by the form including artificial selective examination, machine learning.
Further, the predefined reference format includes the system of the unification in geographical location or coordinate, each measurement unit
One and pollutant title unification, specific processing mode is as follows:
One, geographical location or coordinate processing mode
The mode that main geographical location provides in document has following classification: directly give the place of survey region place or
The latitude and longitude information in region;Provide the administrative division information where survey region;Not yet explicitly provide survey region geographical location letter
Breath is only referred to " certain factory/place " generation.
Different processing is carried out for the three kinds of situations occurred in document: the document for clearly providing latitude and longitude information,
Longitude and latitude shown in accurate recording is simultaneously translated into decimal form storage;For providing the document of survey region administrative division,
It is stored raw information according to administrative division by us, and passes through " Chinese 5 grades of administrative region data disclosed on inquiry network
Library " is translated into more reliable latitude and longitude information.
Two, pollution evaluation factor processed in units mode
During the extraction process, because the research unit of each document offer and disunity, mg/ occur during arrangement
The mixed situation of kg, ng/g, pg/g, μ g/g Deng Bu commensurate, is unfavorable for the processing and analysis in later period.It is unified for mg/kg,
And the unit information in the document pollution evaluation factor is removed, and show the information unit lattice in the form of pure digi-tal.
Three, pollutant title processing mode
During the extraction process, the title of each pollutant can not accomplish unification because of the reason of document itself, some texts
It offers and is provided in the form of Chinese name, some documents are provided in the form of pollutant english abbreviation, bring trouble to last phase tidying up.
" construction land soil pollution risk screens standard (exposure draft three times) " for compareing State Ministry of Environmental Protection's publication, will be therein
Existing contaminant information is unified to a title, the pollutant such as benzo [a] anthracene, and during arrangement, different documents are total
With following different representation: " 2,3- benzanthracene (2,3-BA) ", " BaA ", " benzo (a) anthracene ", " Benz (a)
anthracene".Unified in standardisation process into an identical title " benzo [a] anthracene ".
Wherein, preset analysis system should include:
The document basic information module, for obtain including name of document, Source Periodicals, deliver the time including document
Relevant information;
The plot basic information module, for determining that in target stains plot include geographical location, production information, dirt
Contaminate the relevant information including region area, plot planning purposes;
The research object information module, for obtaining the phase in the target plot including control time, detection method
Close information.
The pollutant monitoring and evaluation data module, the pollution condition for obtaining target stains plot pollutant are retouched
The property stated information.
During the extraction of early period arranges, there may be different arrangement missings or flaws for different documentation & info.It uses
Excel included data verification function combines dedicated computer script program, according to existing decision logic, by computer aided manufacturing
Judgement is helped to extract the confidence level of data.Lower for confidence level document entry information carries out secondary check, prevent compared with
Big extraction problem.Equally, different documents are likely to occur the pollution surveys in same pollution plot and evaluation situation different
As a result, synthesis selects the higher result of confidence level.
S4, extracted impurely block structure information is inquired eventually by database description language, obtains construction land
Organochlorine contamination plot structured message, and the structured message in associated contamination plot is stored and managed using specialized database.
Specifically, database software can be used, related document description information, plot description information, pollutant description letter are constructed respectively
The tables of data of breath, and contacted by way of being associated with external key.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
Claims (6)
1. a kind of impurely block message comprehensive analysis method based on document, which is characterized in that the described method comprises the following steps:
S1: corresponding document is obtained from open platform retrieval by characteristic key words, retrieval is obtained using document management software
Document essential information is managed classification, and filters out investigation class document, establishes Research Literature database;According to research direction from
It is multiple macroscopic view and/or microcosmic point selected characteristic keywords, the characteristic key words be include place, pollution medium, industry class
The Chinese and English keyword of correlation including type, organic pollutant category, pollutant;
S2: the Digital Documents content in the investigation class document that retrieval obtains is read, to all natures in literature content
Paragraph is pre-processed, and by the natural paragraph according to structure phase character word, determines its description content, and be divided into including plot
Multiple contents fragments including essential information, research object essential information, pollutant monitoring and evaluation information, the Feature Words are
For describing the phrase of corresponding chapters and sections content in document;
S3: to the original contents in each contents fragment after division, the extraction of structural data is carried out;
S4: extracted impurely block structure information is inquired by database description language.
2. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S1
The document essential information stated includes piece name, journal title, author, abstract, keyword, delivers time and unit.
3. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S1
Stating pollution medium includes impacted plot soil media and underground aqueous medium.
4. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S2
State structure item Feature Words that plot essential information, research object essential information, pollutant monitoring and evaluation information respectively include such as
Shown in following table:
5. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S3
State the extraction that structural data is carried out to each contents fragment specifically:
S31: the original contents in each contents fragment are converted into predefined reference format;
S32: by the reference format Content Organizing after conversion at structural data;
S33: data are verified by the form including artificial selective examination, machine learning.
6. the impurely block message comprehensive analysis method according to claim 6 based on document, which is characterized in that in S31
The predefined reference format includes the system of the unification in geographical location or coordinate, the unification of each measurement unit and pollutant title
One.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910084134.9A CN109815315B (en) | 2019-01-29 | 2019-01-29 | Method for comprehensively analyzing polluted plot information based on literature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910084134.9A CN109815315B (en) | 2019-01-29 | 2019-01-29 | Method for comprehensively analyzing polluted plot information based on literature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109815315A true CN109815315A (en) | 2019-05-28 |
CN109815315B CN109815315B (en) | 2020-09-22 |
Family
ID=66605550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910084134.9A Expired - Fee Related CN109815315B (en) | 2019-01-29 | 2019-01-29 | Method for comprehensively analyzing polluted plot information based on literature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109815315B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582734A (en) * | 2020-05-12 | 2020-08-25 | 上海海洋大学 | Ocean pollution comparative analysis and risk assessment intelligent method based on python crawler system and SVM |
CN112240869A (en) * | 2020-11-16 | 2021-01-19 | 内蒙古自治区农牧业科学院 | Grassland plot information extraction method based on high-resolution remote sensing image |
CN112860735A (en) * | 2020-12-17 | 2021-05-28 | 北京航空航天大学 | Online database query analysis system and method for persistent organic pollutant exposure |
CN118210779A (en) * | 2024-03-20 | 2024-06-18 | 中国农业科学院农业环境与可持续发展研究所 | Construction method and device of agricultural pollution database, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103491116A (en) * | 2012-06-12 | 2014-01-01 | 深圳市世纪光速信息技术有限公司 | Method and device for processing text-related structural data |
CN104331438A (en) * | 2014-10-24 | 2015-02-04 | 北京奇虎科技有限公司 | Method and device for selectively extracting content of novel webpage |
CN105631055A (en) * | 2016-03-11 | 2016-06-01 | 中国环境科学研究院 | Method and device for displaying water environment quality research data of drainage basin |
CN106844671A (en) * | 2017-01-22 | 2017-06-13 | 北京理工大学 | medical literature intelligent processing method and system |
CN106933846A (en) * | 2015-12-30 | 2017-07-07 | 中国医学科学院医学信息研究所 | The destructuring confluence analysis method of tumour related science document and science data |
CN107876421A (en) * | 2017-10-17 | 2018-04-06 | 安徽草帽网络有限公司 | A kind of agricultural material product intellectuality method for sorting based on information analysis |
-
2019
- 2019-01-29 CN CN201910084134.9A patent/CN109815315B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103491116A (en) * | 2012-06-12 | 2014-01-01 | 深圳市世纪光速信息技术有限公司 | Method and device for processing text-related structural data |
CN104331438A (en) * | 2014-10-24 | 2015-02-04 | 北京奇虎科技有限公司 | Method and device for selectively extracting content of novel webpage |
CN106933846A (en) * | 2015-12-30 | 2017-07-07 | 中国医学科学院医学信息研究所 | The destructuring confluence analysis method of tumour related science document and science data |
CN105631055A (en) * | 2016-03-11 | 2016-06-01 | 中国环境科学研究院 | Method and device for displaying water environment quality research data of drainage basin |
CN106844671A (en) * | 2017-01-22 | 2017-06-13 | 北京理工大学 | medical literature intelligent processing method and system |
CN107876421A (en) * | 2017-10-17 | 2018-04-06 | 安徽草帽网络有限公司 | A kind of agricultural material product intellectuality method for sorting based on information analysis |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582734A (en) * | 2020-05-12 | 2020-08-25 | 上海海洋大学 | Ocean pollution comparative analysis and risk assessment intelligent method based on python crawler system and SVM |
CN112240869A (en) * | 2020-11-16 | 2021-01-19 | 内蒙古自治区农牧业科学院 | Grassland plot information extraction method based on high-resolution remote sensing image |
CN112860735A (en) * | 2020-12-17 | 2021-05-28 | 北京航空航天大学 | Online database query analysis system and method for persistent organic pollutant exposure |
CN112860735B (en) * | 2020-12-17 | 2022-06-14 | 北京航空航天大学 | Online database query analysis system and method for persistent organic pollutant exposure |
CN118210779A (en) * | 2024-03-20 | 2024-06-18 | 中国农业科学院农业环境与可持续发展研究所 | Construction method and device of agricultural pollution database, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109815315B (en) | 2020-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109815315A (en) | A kind of impurely block message comprehensive analysis method based on document | |
Conze et al. | Utilizing the international geo sample number concept in continental scientific drilling during ICDP expedition COSC-1 | |
Blagoderov et al. | No specimen left behind: industrial scale digitization of natural history collections | |
Kuhn et al. | Semantic clustering: Identifying topics in source code | |
CN105761049A (en) | Oil field geology development experiment report publishing and management system | |
CN117473512B (en) | Vulnerability risk assessment method based on network mapping | |
Walton et al. | Landscape analysis for the specimen data refinery | |
Torres et al. | AerialWaste dataset for landfill discovery in aerial and satellite images | |
CN115617889A (en) | GIS-based survey data acquisition and processing method and system | |
Karsvall et al. | SDHK meets NER: Linking Place Names with Medieval Charters and Historical Maps. | |
JP5766438B2 (en) | Method and system for click-through function in electronic media | |
STAICULESCU | Aplication of GIS Tehnologies in Monitoring Biodiversity | |
Torget | Mapping texts: examining the effects of OCR noise on historical newspaper collections | |
Garrido et al. | Information extraction on weather forecasts with semantic technologies | |
Ten Hoopen et al. | Polar biodiversity data: From a national marine platform to a global data portal | |
Machado et al. | An Ontological Gazetter for Geographic Information Retrieval. | |
Riski et al. | Implementation of Web Scraping on Job Vacancy Sites Using Regular Expression Method | |
Ariza‐López et al. | Thematic quality assessment of land surface geospatial data based on confusion matrices: A matrix set for research on measures and procedures | |
Leir et al. | Natural Hazard Database Application: A Tool for Pipeline Decision Makers | |
Ranjan et al. | Significance of bioinformatics in the conservation of biodiversity and databases | |
Petrič et al. | Establishment of a freely accessible GIS database containing the results of groundwater tracing and possibilities of its use | |
Chen et al. | Open Geosciences: A method for fast detection of wind farms from remote sensing images using deep learning and geospatial analysis | |
Michaelis et al. | WikiEvents-A Novel Resource for NLP Downstream Tasks. | |
Falaha et al. | International Journal of Data and Network Science | |
Beja et al. | Contributing datasets to EMODnet Biology. OTGA Training Course EMODnetBiology_2020. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200922 Termination date: 20220129 |
|
CF01 | Termination of patent right due to non-payment of annual fee |