CN109815315A - A kind of impurely block message comprehensive analysis method based on document - Google Patents

A kind of impurely block message comprehensive analysis method based on document Download PDF

Info

Publication number
CN109815315A
CN109815315A CN201910084134.9A CN201910084134A CN109815315A CN 109815315 A CN109815315 A CN 109815315A CN 201910084134 A CN201910084134 A CN 201910084134A CN 109815315 A CN109815315 A CN 109815315A
Authority
CN
China
Prior art keywords
document
impurely
analysis method
comprehensive analysis
contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910084134.9A
Other languages
Chinese (zh)
Other versions
CN109815315B (en
Inventor
马妍
阮子渊
运晓彤
谢云峰
杜晓明
史怡
谷庆宝
王佳琪
张梦頔
张美娟
周生坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology Beijing CUMTB
Original Assignee
China University of Mining and Technology Beijing CUMTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology Beijing CUMTB filed Critical China University of Mining and Technology Beijing CUMTB
Priority to CN201910084134.9A priority Critical patent/CN109815315B/en
Publication of CN109815315A publication Critical patent/CN109815315A/en
Application granted granted Critical
Publication of CN109815315B publication Critical patent/CN109815315B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of impurely block message comprehensive analysis method based on document, the analysis method include the following steps: to obtain corresponding document from open platform retrieval by characteristic key words;The Digital Documents content in the document that retrieval obtains is read, and corresponding literature content is divided into multiple contents fragments;To the original contents in each contents fragment after division, the extraction of structural data is carried out to each contents fragment using the analysis system;Extracted impurely block structure information is inquired by database description language.Analysis method of the present invention, it can be for the magnanimity document in soil environment supervision area, by computer assisted form, efficiently extracts out the structured message of pollution plot pollution condition and stored using specialized database, it is efficient, accurate to have the characteristics that.

Description

A kind of impurely block message comprehensive analysis method based on document
Technical field
The invention belongs to soil pollution supervision area, in particular to a kind of impurely block message comprehensive analysis based on document Method.
Background technique
In soil environment supervision area, researcher and technical staff are usually required to from the written historical materials of magnanimity quickly And the contents such as corresponding plot soil pollution situation are accurately obtained, in order to carry out deep analysis to it or further know Know discovery.So from substantial amounts, content complexity document in obtain structuring description content be pendulum in face of scientific research personnel An important problem.
In existing working method, corresponding research contents is extracted from soil investigation document and is relied primarily on manually in document Appearance is readed over, and is then labeled extraction to them again.This kind of working method needs much to have rich experiences and specially know The sorter of knowledge does a large amount of work, not only of a high price, under efficiency, and is easy error.
Summary of the invention
In view of the above problems, the impurely block message synthesis point based on document that the object of the present invention is to provide a kind of Analysis method.By computer assisted form, specification documents identification range and identification content, and it is aided with the side of secondary verification Formula can be very good to solve the above problems.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of impurely block message overall analysis system based on document, including document basic information module, plot are basic Information module, research object information module and pollutant monitoring and evaluation data information module;
The document basic information module, for obtain including name of document, Source Periodicals, deliver the time including document Relevant information;
The plot basic information module, for determining that in target stains plot include geographical location, production information, dirt Contaminate the relevant information including region area, plot planning purposes;
The research object information module, for obtaining the phase in the target plot including control time, detection method Close information;
The pollutant monitoring and evaluation data module, the pollution condition for obtaining target stains plot pollutant are retouched The property stated information.
A kind of impurely block message comprehensive analysis method based on document, comprising the following steps:
S1: obtaining corresponding document from open platform retrieval by characteristic key words, using document management software to retrieving To document essential information be managed classification, establish Research Literature database;
According to research direction from multiple macroscopic views and/or microcosmic point selected characteristic keyword, the characteristic key words are packets Include the Chinese and English keyword of correlation including place, pollution medium, industry type, organic pollutant category, pollutant;
S2: reading the Digital Documents content in the investigation class document that retrieval obtains, to all in literature content Paragragh drops into capable pretreatment, by the natural paragraph according to structure phase character word, determines its description content, and be divided into including Multiple contents fragments including plot essential information, research object essential information, pollutant monitoring and evaluation information, the feature Word is in document for describing the phrase of corresponding chapters and sections content;
S3: to the original contents in each contents fragment after division, the extraction of structural data is carried out;
S4: extracted impurely block structure information is inquired by database description language.
Further, document essential information described in S1 includes piece name, journal title, author, abstract, keyword, delivers Time and unit.
Further, pollution medium described in S1 includes impacted plot soil media and underground aqueous medium.
Further, plot essential information described in S2, research object essential information, pollutant monitoring and evaluation information point The structure item Feature Words for not including are as shown in the table:
Further, the extraction of structural data is carried out described in S3 to each contents fragment specifically:
S31: the original contents in each contents fragment are converted into predefined reference format;
S32: by the reference format Content Organizing after conversion at structural data;
S33: data are verified by the form including artificial selective examination, machine learning.
Further, predefined reference format described in S31 includes the unification in geographical location or coordinate, each measurement unit Unification and pollutant title unification.
The present invention having the beneficial effect that compared with prior art
Impurely block message comprehensive analysis method of the present invention based on document, can be for soil environment supervision neck Magnanimity document in domain efficiently extracts out the structured message of pollution plot pollution condition by computer assisted form And stored using specialized database, it is efficient, accurate to have the characteristics that.
The present invention will be further described in detail below based on the drawings and embodiments.
Detailed description of the invention
Fig. 1 is the method flow diagram of the impurely block message comprehensive analysis method of the present invention based on document.
Specific embodiment
The invention of this programme is described further combined with specific embodiments below, but protection scope of the present invention is not It is limited to this.
Referring to Fig. 1, the present embodiment proposes that one kind obtains construction land organochlorine contamination plot structured message from document Method, and be described further as follows:
S1, pollution plot is retrieved in the database including Web of Science, CNKI by particular keywords Pertinent literature is studied, retrieval is obtained using document management software document essential information (including piece name, journal title, author, pluck Want, keyword, deliver time and unit) it is managed classification, establish Research Literature database;
According to research direction from multiple macroscopic views and/or microcosmic point selected characteristic keyword, the characteristic key words are packets Include place, pollution medium (including impacted plot soil media and underground aqueous medium), industry type, organic contamination species The Chinese and English keyword of correlation including class, pollutant.
By taking construction land organic contamination as an example, the particular keywords for being included include:
Macroscopic aspect 1: place
Macroscopic aspect 2: pollution medium (soil and groundwater)
2.1 soil
2.2 underground water
Macroscopic aspect 3: trade classification (11 industries)
Macroscopic aspect 4: organic pollutant
4.1 general classifications (big classification)
4.2 volatile organic contaminants (specific)
4.3 half volatile organic contaminants (specific)
4.4 Polychlorinated biphenyls (specific)
4.5 polycyclic aromatic hydrocarbons (specific)
4.6 benzene homologues (specific)
4.7 total petroleum hydrocarbons (specific)
4.8 persistence organic pollutants (specific)
Macroscopic aspect 5: pollutant associated description
The Digital Documents content in the investigation class document that S2, reading retrieval obtain, to all in literature content Paragragh drops into capable pretreatment, by the natural paragraph according to structure phase character word, determines its description content, and be divided into including Multiple contents fragments including plot essential information, research object essential information, pollutant monitoring and evaluation information, the feature Word is in document for describing the phrase of corresponding chapters and sections content;
Wherein, Feature Words are in document for describing the phrase of corresponding chapters and sections content, for example, according to " research area's overview " this Paragraph natural where the specific word and its related natural paragraph can be divided into a contents fragment, this content by a Feature Words Segment is used to describe the profile information of the pollution plot target.By the feature word list of setting electronic edition document XML It is searched in information, and marks off the contents fragment of multiple and different description informations, these common contents fragments include that " plot is basic Information, research object essential information, pollutant monitoring and evaluation information ".
The structure that the plot essential information, research object essential information, pollutant monitoring and evaluation information respectively include Item Feature Words are as shown in the table:
S3, to the original contents in each contents fragment after division, using preset analysis system to each contents fragment into Unprocessed form information is converted to predefined reference format by the extraction of row structural data, will be in the reference format after conversion Appearance is organized into structural data;Data are verified by the form including artificial selective examination, machine learning.
Further, the predefined reference format includes the system of the unification in geographical location or coordinate, each measurement unit One and pollutant title unification, specific processing mode is as follows:
One, geographical location or coordinate processing mode
The mode that main geographical location provides in document has following classification: directly give the place of survey region place or The latitude and longitude information in region;Provide the administrative division information where survey region;Not yet explicitly provide survey region geographical location letter Breath is only referred to " certain factory/place " generation.
Different processing is carried out for the three kinds of situations occurred in document: the document for clearly providing latitude and longitude information, Longitude and latitude shown in accurate recording is simultaneously translated into decimal form storage;For providing the document of survey region administrative division, It is stored raw information according to administrative division by us, and passes through " Chinese 5 grades of administrative region data disclosed on inquiry network Library " is translated into more reliable latitude and longitude information.
Two, pollution evaluation factor processed in units mode
During the extraction process, because the research unit of each document offer and disunity, mg/ occur during arrangement The mixed situation of kg, ng/g, pg/g, μ g/g Deng Bu commensurate, is unfavorable for the processing and analysis in later period.It is unified for mg/kg, And the unit information in the document pollution evaluation factor is removed, and show the information unit lattice in the form of pure digi-tal.
Three, pollutant title processing mode
During the extraction process, the title of each pollutant can not accomplish unification because of the reason of document itself, some texts It offers and is provided in the form of Chinese name, some documents are provided in the form of pollutant english abbreviation, bring trouble to last phase tidying up. " construction land soil pollution risk screens standard (exposure draft three times) " for compareing State Ministry of Environmental Protection's publication, will be therein Existing contaminant information is unified to a title, the pollutant such as benzo [a] anthracene, and during arrangement, different documents are total With following different representation: " 2,3- benzanthracene (2,3-BA) ", " BaA ", " benzo (a) anthracene ", " Benz (a) anthracene".Unified in standardisation process into an identical title " benzo [a] anthracene ".
Wherein, preset analysis system should include:
The document basic information module, for obtain including name of document, Source Periodicals, deliver the time including document Relevant information;
The plot basic information module, for determining that in target stains plot include geographical location, production information, dirt Contaminate the relevant information including region area, plot planning purposes;
The research object information module, for obtaining the phase in the target plot including control time, detection method Close information.
The pollutant monitoring and evaluation data module, the pollution condition for obtaining target stains plot pollutant are retouched The property stated information.
During the extraction of early period arranges, there may be different arrangement missings or flaws for different documentation & info.It uses Excel included data verification function combines dedicated computer script program, according to existing decision logic, by computer aided manufacturing Judgement is helped to extract the confidence level of data.Lower for confidence level document entry information carries out secondary check, prevent compared with Big extraction problem.Equally, different documents are likely to occur the pollution surveys in same pollution plot and evaluation situation different As a result, synthesis selects the higher result of confidence level.
S4, extracted impurely block structure information is inquired eventually by database description language, obtains construction land Organochlorine contamination plot structured message, and the structured message in associated contamination plot is stored and managed using specialized database. Specifically, database software can be used, related document description information, plot description information, pollutant description letter are constructed respectively The tables of data of breath, and contacted by way of being associated with external key.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (6)

1. a kind of impurely block message comprehensive analysis method based on document, which is characterized in that the described method comprises the following steps:
S1: corresponding document is obtained from open platform retrieval by characteristic key words, retrieval is obtained using document management software Document essential information is managed classification, and filters out investigation class document, establishes Research Literature database;According to research direction from It is multiple macroscopic view and/or microcosmic point selected characteristic keywords, the characteristic key words be include place, pollution medium, industry class The Chinese and English keyword of correlation including type, organic pollutant category, pollutant;
S2: the Digital Documents content in the investigation class document that retrieval obtains is read, to all natures in literature content Paragraph is pre-processed, and by the natural paragraph according to structure phase character word, determines its description content, and be divided into including plot Multiple contents fragments including essential information, research object essential information, pollutant monitoring and evaluation information, the Feature Words are For describing the phrase of corresponding chapters and sections content in document;
S3: to the original contents in each contents fragment after division, the extraction of structural data is carried out;
S4: extracted impurely block structure information is inquired by database description language.
2. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S1 The document essential information stated includes piece name, journal title, author, abstract, keyword, delivers time and unit.
3. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S1 Stating pollution medium includes impacted plot soil media and underground aqueous medium.
4. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S2 State structure item Feature Words that plot essential information, research object essential information, pollutant monitoring and evaluation information respectively include such as Shown in following table:
5. the impurely block message comprehensive analysis method according to claim 1 based on document, which is characterized in that institute in S3 State the extraction that structural data is carried out to each contents fragment specifically:
S31: the original contents in each contents fragment are converted into predefined reference format;
S32: by the reference format Content Organizing after conversion at structural data;
S33: data are verified by the form including artificial selective examination, machine learning.
6. the impurely block message comprehensive analysis method according to claim 6 based on document, which is characterized in that in S31 The predefined reference format includes the system of the unification in geographical location or coordinate, the unification of each measurement unit and pollutant title One.
CN201910084134.9A 2019-01-29 2019-01-29 Method for comprehensively analyzing polluted plot information based on literature Expired - Fee Related CN109815315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910084134.9A CN109815315B (en) 2019-01-29 2019-01-29 Method for comprehensively analyzing polluted plot information based on literature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910084134.9A CN109815315B (en) 2019-01-29 2019-01-29 Method for comprehensively analyzing polluted plot information based on literature

Publications (2)

Publication Number Publication Date
CN109815315A true CN109815315A (en) 2019-05-28
CN109815315B CN109815315B (en) 2020-09-22

Family

ID=66605550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910084134.9A Expired - Fee Related CN109815315B (en) 2019-01-29 2019-01-29 Method for comprehensively analyzing polluted plot information based on literature

Country Status (1)

Country Link
CN (1) CN109815315B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582734A (en) * 2020-05-12 2020-08-25 上海海洋大学 Ocean pollution comparative analysis and risk assessment intelligent method based on python crawler system and SVM
CN112240869A (en) * 2020-11-16 2021-01-19 内蒙古自治区农牧业科学院 Grassland plot information extraction method based on high-resolution remote sensing image
CN112860735A (en) * 2020-12-17 2021-05-28 北京航空航天大学 Online database query analysis system and method for persistent organic pollutant exposure
CN118210779A (en) * 2024-03-20 2024-06-18 中国农业科学院农业环境与可持续发展研究所 Construction method and device of agricultural pollution database, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103491116A (en) * 2012-06-12 2014-01-01 深圳市世纪光速信息技术有限公司 Method and device for processing text-related structural data
CN104331438A (en) * 2014-10-24 2015-02-04 北京奇虎科技有限公司 Method and device for selectively extracting content of novel webpage
CN105631055A (en) * 2016-03-11 2016-06-01 中国环境科学研究院 Method and device for displaying water environment quality research data of drainage basin
CN106844671A (en) * 2017-01-22 2017-06-13 北京理工大学 medical literature intelligent processing method and system
CN106933846A (en) * 2015-12-30 2017-07-07 中国医学科学院医学信息研究所 The destructuring confluence analysis method of tumour related science document and science data
CN107876421A (en) * 2017-10-17 2018-04-06 安徽草帽网络有限公司 A kind of agricultural material product intellectuality method for sorting based on information analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103491116A (en) * 2012-06-12 2014-01-01 深圳市世纪光速信息技术有限公司 Method and device for processing text-related structural data
CN104331438A (en) * 2014-10-24 2015-02-04 北京奇虎科技有限公司 Method and device for selectively extracting content of novel webpage
CN106933846A (en) * 2015-12-30 2017-07-07 中国医学科学院医学信息研究所 The destructuring confluence analysis method of tumour related science document and science data
CN105631055A (en) * 2016-03-11 2016-06-01 中国环境科学研究院 Method and device for displaying water environment quality research data of drainage basin
CN106844671A (en) * 2017-01-22 2017-06-13 北京理工大学 medical literature intelligent processing method and system
CN107876421A (en) * 2017-10-17 2018-04-06 安徽草帽网络有限公司 A kind of agricultural material product intellectuality method for sorting based on information analysis

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582734A (en) * 2020-05-12 2020-08-25 上海海洋大学 Ocean pollution comparative analysis and risk assessment intelligent method based on python crawler system and SVM
CN112240869A (en) * 2020-11-16 2021-01-19 内蒙古自治区农牧业科学院 Grassland plot information extraction method based on high-resolution remote sensing image
CN112860735A (en) * 2020-12-17 2021-05-28 北京航空航天大学 Online database query analysis system and method for persistent organic pollutant exposure
CN112860735B (en) * 2020-12-17 2022-06-14 北京航空航天大学 Online database query analysis system and method for persistent organic pollutant exposure
CN118210779A (en) * 2024-03-20 2024-06-18 中国农业科学院农业环境与可持续发展研究所 Construction method and device of agricultural pollution database, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109815315B (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN109815315A (en) A kind of impurely block message comprehensive analysis method based on document
Conze et al. Utilizing the international geo sample number concept in continental scientific drilling during ICDP expedition COSC-1
Blagoderov et al. No specimen left behind: industrial scale digitization of natural history collections
Kuhn et al. Semantic clustering: Identifying topics in source code
CN105761049A (en) Oil field geology development experiment report publishing and management system
CN117473512B (en) Vulnerability risk assessment method based on network mapping
Walton et al. Landscape analysis for the specimen data refinery
Torres et al. AerialWaste dataset for landfill discovery in aerial and satellite images
CN115617889A (en) GIS-based survey data acquisition and processing method and system
Karsvall et al. SDHK meets NER: Linking Place Names with Medieval Charters and Historical Maps.
JP5766438B2 (en) Method and system for click-through function in electronic media
STAICULESCU Aplication of GIS Tehnologies in Monitoring Biodiversity
Torget Mapping texts: examining the effects of OCR noise on historical newspaper collections
Garrido et al. Information extraction on weather forecasts with semantic technologies
Ten Hoopen et al. Polar biodiversity data: From a national marine platform to a global data portal
Machado et al. An Ontological Gazetter for Geographic Information Retrieval.
Riski et al. Implementation of Web Scraping on Job Vacancy Sites Using Regular Expression Method
Ariza‐López et al. Thematic quality assessment of land surface geospatial data based on confusion matrices: A matrix set for research on measures and procedures
Leir et al. Natural Hazard Database Application: A Tool for Pipeline Decision Makers
Ranjan et al. Significance of bioinformatics in the conservation of biodiversity and databases
Petrič et al. Establishment of a freely accessible GIS database containing the results of groundwater tracing and possibilities of its use
Chen et al. Open Geosciences: A method for fast detection of wind farms from remote sensing images using deep learning and geospatial analysis
Michaelis et al. WikiEvents-A Novel Resource for NLP Downstream Tasks.
Falaha et al. International Journal of Data and Network Science
Beja et al. Contributing datasets to EMODnet Biology. OTGA Training Course EMODnetBiology_2020.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200922

Termination date: 20220129

CF01 Termination of patent right due to non-payment of annual fee