CN110222057A

CN110222057A - A kind of construction method of aerosol document formatted data base

Info

Publication number: CN110222057A
Application number: CN201910469969.6A
Authority: CN
Inventors: 张克俊; 郑俊; 黄小倚; 陈洁; 刘�东; 毕磊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-09-10

Abstract

The invention discloses a kind of construction methods of aerosol document formatted data base, include: that (1) collects global aerosol data in literature, determines the document association attributes, text statistical information attribute, datagram specific value attribute of aerosol data in literature；(2) the document association attributes for extracting each aerosol document, form document associated property data table；(3) conversion of PDF to TXT text formatting, Text Pretreatment and canonical template matching are carried out to each aerosol document, forms text statistics table；(4) extraction of data point coordinate information is carried out to datagram, forms datagram specific value tables of data；(5) index relative between document associated property data table, text statistics table and datagram specific value tables of data is established, corresponding document association attributes, text statistical information and datagram specific value are stored, aerosol document formatted data base is formed.The method further includes providing crowdsourcing platform and knowledge services.

Description

A kind of construction method of aerosol document formatted data base

Technical field

The invention belongs to database establishment fields, and in particular to a kind of building side of aerosol document formatted data base Method.

Background technique

In aerosol art, the source of scholars' data is magnanimity there are one important approach in addition to field survey Historical document.For this kind of data, is obtained if necessary to more comprehensive, need the help of literature review mostly.But for making For person, needs to read amount of literature data and carry out sorting-out in statistics, and often in a few years just need more to newly arrive to guarantee such document Statistical knowledge real-time, heavy workload and cumbersome.For reader, it is desirable to which rapidly obtaining such statistical knowledge also needs Voluntarily to search for Review literature and judge their authority, the knowledge of acquisition also by document author want to show as The limitation in area, time etc..

Other field such as clinical medicine, socialization government affairs etc. had already appeared to document carry out structured analysis thought with Using.In clinical medicine domain, there is systematic review and meta analysis.Systematic review is different from chatting of being generally used at present The property stated summary, it concentrates on a certain specific problem, and unbiased poorly collect now all of delivered and do not delivered comprehensively as far as possible Documents and materials are analyzed and evaluated according to the standard drafted in advance, and author can periodically collect new original research data To timely update and supplement knowledge.And meta is analyzed, then is to merge number using statistical method appropriate in systematic review According to obtaining the process of comprehensive conclusion with the average effect of quantization, also referred to as quantitative system is summarized.Systematic review overcomes narration Property the imperfection of the summary and defect of non-objectivity, but the work for very taking time and accounting for resource, usual one Systematic review needs to expend a team 1 year or even longer time, to the requirement such as team information retrieval, domain-specific knowledge It is very high, and the demand of update and supplement is but also the energy put into needed for systematic review is infinitely expanded.

In socialization government affairs field, there is scholar to propose the document analysis frame an of structuring in 2015, it is right with this Existing socialization government affairs foreign study document is summarized and has been analyzed, and has been made some progress, and is socialization The research of government affairs provides certain foundation and enlightenment.Frame content set by them mainly includes grinding contained by literature content Study carefully situation (research theme, survey region, unit of analysis), research method (research type, research method, research method number and Data source) and article deliver information (delivering time, periodical) etc., the analysis work of document is complete by the full manual read of three people At.It can thus be seen that although having the research of some early periods in other field, it is related to aerosol art is rare, and It is very taken time and effort to build formatted data base.

Summary of the invention

The object of the present invention is to provide a kind of construction methods of aerosol document formatted data base, which can The data being dispersed in magnanimity aerosol historical document are collected, formatted storage builds up electronic databank, and builds website offer The Additional Services such as visualization, to attract user spontaneously to pass through crowdsourcing platform expanding data.

The technical solution of the present invention is as follows:

A kind of construction method of aerosol document formatted data base, comprising the following steps:

(1) global aerosol data in literature is collected, determines document association attributes, the text statistics letter of aerosol data in literature Cease attribute, datagram specific value attribute；

(2) the document association attributes for extracting each aerosol document, form document associated property data table；

(3) conversion of PDF to TXT text formatting, Text Pretreatment and canonical template are carried out to each aerosol document Match, to realize the extraction of text statistical information, forms text statistics table；

(4) extraction of data point coordinate information is carried out to datagram, to extract datagram specific value, it is specific forms datagram Numeric data table；

(5) it establishes between document associated property data table, text statistics table and datagram specific value tables of data Index relative, corresponding document association attributes, text statistical information and datagram specific value are stored, formed gas Colloidal sol document formatted data base.

The construction method further includes providing crowdsourcing platform and knowledge services,

For crowdsourcing platform, user inputs data in literature, text statistical data by the crowdsourcing platform and uploads image, To realize that aerosol data in literature expands；

For knowledge services, there is data statistic analysis and visualization function and data files guiding function, use Echarts.js plug-in unit carries out map, line chart and the data visualization of box traction substation, to area, wavelength, height, time etc. into Row statistical analysis；Based on formatted data base, allows users to through specified aerosol type, geographical location, value range, sees Survey wavelength, observation time etc. scan for, and provide corresponding documentation & info.

Compared with prior art, the present invention constructs aerosol document formatted data base by simpler method, should Aerosol document formatted data base is capable of providing the aerosol document for facilitating inquiry to user, in addition, also building website offer The Additional Services such as visualization, to attract user spontaneously to pass through crowdsourcing platform expanding data.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do simply to introduce, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art, can be with root under the premise of not making the creative labor Other accompanying drawings are obtained according to these attached drawings.

Fig. 1 is the realization block diagram of the construction method embodiment of aerosol document formatted data base provided by the invention；

Fig. 2 (a) is former scatter plot, and Fig. 2 (b) is the scatterplot recognition result figure for carrying out scatterplot identification to Fig. 2 (a) and obtaining.

Fig. 3 (a) is former line chart, and Fig. 3 (b) is the broken line recognition result figure for carrying out broken line identification to Fig. 3 (a) and obtaining；

Fig. 4 (a) is former column diagram, and Fig. 4 (b) is the cylindricality recognition result figure for carrying out cylindricality identification to Fig. 4 (a) and obtaining；

Fig. 5 is the schematic diagram of crowdsourcing platform.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments to this Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention, And the scope of protection of the present invention is not limited.

Based on background technique, the present invention is concluded by the key parameter of the forms research aerosol art such as investigation and interview A kind of aerosol data in literature structured record comprising 15 attribute divides normal form out, and the gas collected all over the world over 10 years is molten Glue paper as data source, meanwhile, in order to reduce manpower and time cost, pass through combine computer field text mining method And image processing tool, to extract the important information and life that aerosol art pertinent literature is included automatically according to this normal form At the online database of formatting.

In aerosol art, text data is commonly used to descriptive statistics information, and picture is used to visualize specific data Value.During text mining, need to be related to the Knowledge Extraction the relevant technologies in the field NLP, and in image data excavation, it needs It is related to the technology of coordinate data reduction.

The birth that new paper is monitored by crawler has been opened crowdsourcing data in the form of online website and has submitted platform, protected The formatting for having demonstrate,proved Data expansion is unified, and provides including services such as data visualization, data files guides, learns to field Person, which brings, also wishes to that author is attracted spontaneously to supplement new data outside the convenient way of knowledge acquisition, reduce database expansion at This, to finally realize a real-time, comprehensive, expansible global aerosol data library knowledge acquisition platform.

Referring to Fig. 1, the construction method of aerosol document formatted data base provided in this embodiment, comprising the following steps:

S101 collects global aerosol data in literature, determines document association attributes, the text statistics of aerosol data in literature Information attribute, datagram specific value attribute.

In embodiment, 10000 or more global aerosol data in literature is had collected.

In the document association attributes, text statistical information attribute, datagram specific value category for determining aerosol data in literature Property when, devise questionnaire experiment.Experimental subjects is 30 students working on a postgraduate program of photoelectricity institute, and questionnaire is divided into 5 parts: document hair Table information, aerosol essential information, optical parameter information, space time information, particle microphysical property, subject are needed to this 5 portions Divide and provides the important attribute for oneself thinking that an aerosol data pertinent literature may include respectively.And 30 parts to recycling are asked The arrangement that volume carries out attribute merges and statistics, has obtained preliminary normal form attribute candidate as a result, as shown in table 1.

Table 1

It is subsequent, the experts and scholars of aerosol art have further been interviewed, have inquired attribute that this PRELIMINARY RESULTS is included is whether Useful and asked whether omission, record modification opinion finds out the form of expression of aerosol data in the literature generally with text Original descriptive statistics numerical value, the corresponding specific data point of text, therefore the importance and record of combined data are described with image When it is clarity, the attribute for finally being forgiven normal form is determined as 15 kinds, expert's return visit is carried out to it, by confirmation after, will remove Document delivers the merging of four class data outside information and is divided into text statistical data and the specific data these two types data of image, therefore normal form institute The attribute contained is divided into Properties of Documents, text data attribute and image data attribute in major class.As a result such as table 2:

Table 2

The data of aerosol art are dispersed in text and image, artificial in order to save, we, which combine, in the present invention calculates Machine technology is automatically extracted or is assisted.

S102 extracts the document association attributes of each aerosol document, forms document associated property data table.

The downloading of file is carried out using PYTHON crawler when collecting historical documents, and passes through analysis mesh The record that the HTML structure of mark webpage can synchronize the relevant information in addition to pdf document, including periodical, No. DOI, document Name, author and unit, to obtain the relevant attribute list of document.

Related crawler code is as follows:

Text formatting conversion, Text Pretreatment and canonical template matching are formed with realizing the extraction of text statistical information Text statistics table.

Specifically, it is extracted for text statistical information, since TXT text is easier to be parsed by a program, using python's Pdf document is converted into TXT text by library pdfminer；

For example, PDF is converted into TXT using pdfminer:

It segmented using the natural language processing library NLTK of python, remove stop words, stem extracts and name is real Body identification, to realize to Text Pretreatment.

For example, carrying out Text Pretreatment using following procedure:

Canonical template matching the following steps are included:

(a) keyword positions: being searched in aerosol document text by string matching and positions eight kinds of optical parameters Name initial character corresponding to position, wherein eight kinds of Optical Parametrics it is several include Lidar Ratios, Depolarization Ratio, backscattering coefficient, Extinction coefficient, optical thickness, spectrum depolarization ratio, color ratio, Angstrom index；

(b) attribute entities identify: attribute contained by normal form include address, optical parameter etc. name entity, also comprising as the time, Name Entity recognition and canonical matching is respectively adopted to be identified in the regular pattern composites entity such as time, wavelength, parameter value, the two.It adopts Address is identified with based on the name entity recognition techniques of NLTK, it is molten by regular expression matching time, place, wavelength, gas Glue type and optical parameter numerical value；

(c) keyword: being associated with by rule match in paragraph according to matching rule nearest in section with attribute entities, realizes automatic It extracts the structured message that is matched in text and carries out paragraph content displaying, then, in such a way that manual read audits pair This extracts result and is modified and supplements.

For example, attribute entities identify canonical matching formula

S104 carries out the extraction of data point coordinate information to datagram, to extract datagram specific value, forms datagram tool Body numeric data table.

Datagram is scatter plot, line chart and the column diagram for indicating data characteristic.Specifically, data are carried out to datagram Point coordinate information, which extracts, includes:

S104-1 carries out Slant Rectify and data point extracted region to datagram；

S104-2 carries out data point extraction to data point region to every class datagram；

S104-3 converts data point coordinate information for each data point according to preset coordinate attributes and coordinate threshold value.

Wherein, Slant Rectify is carried out to datagram and data point extracted region specifically includes:

Gray processing is carried out to datagram using mean value method, obtains gray level image, specifically, calculates picture using formula (1) The gray value of vegetarian refreshments (i, j):

Gray (i, j)=(R (i, j)+G (i, j)+B (i, j)) (1)

Wherein, R (i, j), G (i, j), B (i, j) respectively represent R corresponding to location of pixels (i, j), G, channel B color Value；

Using the image border of the Canny operator detection gray level image of OpenCV；

Using the intramarginal straight line set of Hough transform detection method detection image, and extract longest straight line L1；

The tilt angle for calculating longest straight line L1 rotates gray level image according to tilt angle to realize that inclination is rectified Just；

The longest vertical intersection L2 vertical with longest straight line L1 is searched in image border, is hung down with longest straight line L1 and longest The intersection point of straight intersection L2 is origin, using the longest straight line L1 and vertical intersection L2 of longest as boundary, extracts data point region.

When carrying out data point extraction to data point region, in order to promote the extraction rate of data point coordinate information, for Scatter plot, line chart and column diagram, specific as follows using different extracting modes:

When datagram is scatter plot, Hough circle detection is carried out using Hough gradient method, to obtain the circle in scatter plot Image vegetarian refreshments, which is data point；

Hough circle detection method be mainly utilized edge detection to each point using Sobel function calculate its ladder Degree, and each point adds up in accumulator on the straight line specified using the gradient by slope, selects from two-dimentional accumulator Candidate centers are selected, and to each center, consider all non-zero pixels, if a candidate centers receive the non-zero picture of edge image Element is most adequately supported, and there are enough distances at the center selected to early period, determines that the candidate center of circle is required identification Circle the center of circle.

For example, Fig. 2 (a) is former scatter plot, using can be obtained as shown in Fig. 2 (b) after Hough circle detection Scatterplot recognition result figure.

When datagram is line chart, according to highly traversing all pixels in the corresponding pixel list of every a line from top to bottom Point asks the median of every a line pixel as data point；

Fig. 3 (a) is former line chart, carries out broken line identification to Fig. 3 (a) and obtains broken line recognition result figure shown in Fig. 3 (b). Pass through extract result matched curve comparison, it can be deduced that using ergodic data point and take significant figure strong point median method It works well, can preferably extract the data in line chart.

When datagram is column diagram, horizontal line is identified using edge detection method, and area is blank area above horizontal line When domain, the intermediary image vegetarian refreshments of the horizontal line is data point.

For example, Fig. 4 (a) is former column diagram, cylindricality identification is carried out to Fig. 4 (a) and obtains the cylindricality as shown in Fig. 4 (b) Recognition result figure.

Wherein, data point coordinate information packet is converted by each data point according to preset coordinate attributes and coordinate threshold value It includes:

Data point coordinate information is converted by data point using formula (1) and formula (2):

X=X1+ (X2-X1) * b/width (1)

Y=Y1+ (Y2-Y1) * (height-a+1)/height (2)

Wherein, X1, X2, Y1, Y2 are respectively preset axis of abscissas minimum value, maximum value, ordinate minimum value, maximum Value, (a, b) indicate that the pixel coordinate of data point, width and height respectively indicate the width and height of datagram.

When in scatter plot and line chart there are when multi-class data, firstly, determining each classification number by way of cluster According to corresponding color value, specifically: edge analysis is carried out to image, extracts the scatterplot in scatter plot or the folding in line chart figure Line extracts the corresponding pixel set of scatterplot by traversing border circular areas corresponding to the identified scatterplot center of circle and its radius, It is clustered using pixel value of the Kmeans clustering procedure to scatterplot or broken line, records the pixel value of the cluster centre of k clustering cluster, And the pixel value of all pixels point in each clustering cluster is revised as the pixel value of corresponding cluster centre, it obtains so every The corresponding color value of a categorical data, while the pixel value of cluster centre being shown in the form of color bar；Then, according to The corresponding color value of classification is modified every class data；

When the corresponding color value of the data category for receiving user's input, it can realize and category data point coordinate is believed The extraction of breath and automatic classification.

S105, establish document associated property data table, text statistics table and datagram specific value tables of data it Between index relative, corresponding document association attributes, text statistical information and datagram specific value are stored, formed Aerosol document formatted data base.

Specifically, during being stored in aerosol document formatted data base, need to be added major key id and outer key index, Rule is as follows:

Major key paper_id, index attributes text_data_ids and figure_numbers are increased newly to Properties of Documents table, In two index attributes be all array form, if figure_numbers probable value be [2,5,6], represent document Fig. 2, figure 5, there are aerosol datas by Fig. 6；

Major key text_data_id, index attributes paper_id and figure_number are increased newly to text attribute table, wherein Two index values are all int form, as paper_id has recorded the identification number in this text data source, figure_number Have recorded the picture that text statistic provides support or further illustrates thus；

Major key image_data_id, index attributes paper_id and figure_number are increased newly to image attributes table.

Therefore, if it is desired to the data for obtaining the figure i that specified document includes, it only need to be in picture attribute table according to paper_id =this_paper_id+figure_number=i is screened.

On the basis of above-mentioned construction method, which further includes providing crowdsourcing platform and knowledge services, by knowing Know service come promote user spontaneously crowdsourcing platform progress new data submission, realize the sustainable extension of database.

The data statistic analysis that the knowledge services are capable of providing more can only be intelligent to field scholar summarizes knowledge, and attracts use Family spontaneously carries out the upload of new data, to reach the sustainable extension of database.

Specifically, for the crowdsourcing platform provided as shown in figure 5, by unified format constraints, user can pass through input text Data, text statistical data and uploading pictures are offered to carry out the amplification of new data.

Technical solution of the present invention and beneficial effect is described in detail in above-described specific embodiment, Ying Li Solution is not intended to restrict the invention the foregoing is merely presently most preferred embodiment of the invention, all in principle model of the invention Interior done any modification, supplementary, and equivalent replacement etc. are enclosed, should all be included in the protection scope of the present invention.

Claims

1. a kind of construction method of aerosol document formatted data base, comprising the following steps:

(1) global aerosol data in literature is collected, determines document association attributes, the text statistical information category of aerosol data in literature Property, datagram specific value attribute；

(3) conversion of PDF to TXT text formatting, Text Pretreatment and canonical template matching are carried out to each aerosol document, with It realizes the extraction of text statistical information, forms text statistics table；

(4) extraction of data point coordinate information is carried out to datagram, to extract datagram specific value, forms datagram specific value Tables of data；

(5) rope between document associated property data table, text statistics table and datagram specific value tables of data is established Draw relationship, corresponding document association attributes, text statistical information and datagram specific value are stored, forms aerosol Document formatted data base.

2. the construction method of aerosol document formatted data base as described in claim 1, which is characterized in that document correlation category Property, text statistical information attribute, datagram specific value attribute include:

3. the construction method of aerosol document formatted data base as described in claim 1, which is characterized in that in step (3), Pdf document is converted by TXT text using the library pdfminer of python；

It segmented using the natural language processing library NLTK of python, remove stop words, stem extracts and name entity is known Not, to realize to Text Pretreatment；

Canonical template matching the following steps are included:

(a) keyword positioning: being searched in aerosol document text by string matching and it is several to position eight kinds of Optical Parametrics Position corresponding to initial character, wherein several eight kinds of Optical Parametrics include Lidar Ratios, Depolarization Ratio, backscattering coefficient, delustring Coefficient, optical thickness, spectrum depolarization ratio, color ratio, Angstrom index；

(b) attribute entities identify: identifying address using the name entity recognition techniques based on NLTK, pass through regular expression With time, place, wavelength, aerosol type and optical parameter numerical value；

(c) rule match in paragraph: keyword is associated with attribute entities according to matching rule nearest in section, realization automatically extracts The structured message that is matched in text simultaneously carries out paragraph content displaying, then, mentions in such a way that manual read audits to this Result is taken to be modified and supplement.

4. the construction method of aerosol document formatted data base as described in claim 1, which is characterized in that step (4) tool Body includes:

(4-1) carries out Slant Rectify and data point extracted region to datagram；

(4-2) carries out data point extraction to every class datagram, to data point region；

(4-3) converts data point coordinate information for each data point according to preset coordinate attributes and coordinate threshold value.

5. the construction method of aerosol document formatted data base as claimed in claim 4, which is characterized in that step (4-1) It specifically includes:

Gray processing is carried out to datagram using mean value method, obtains gray level image；

The tilt angle for calculating longest straight line L1, rotates to realize Slant Rectify gray level image according to tilt angle；

The longest vertical intersection L2 vertical with longest straight line L1 is searched in image border, is vertically handed over longest straight line L1 and longest The intersection point of line L2 is origin, using the longest straight line L1 and vertical intersection L2 of longest as boundary, extracts data point region.

6. the construction method of aerosol document formatted data base as claimed in claim 5, which is characterized in that when datagram is When scatter plot, Hough circle detection is carried out using Hough gradient method, to obtain the point of the circular pixel in scatter plot, the circular pixel Point is data point；

When datagram is line chart, according to highly traversing all pixels point in the corresponding pixel list of every a line from top to bottom, Ask the median of every a line pixel as data point；

When datagram is column diagram, horizontal line is identified using edge detection method, and when horizontal line top area is white space, The intermediary image vegetarian refreshments of the horizontal line is data point.

7. the construction method of aerosol document formatted data base as claimed in claim 4, which is characterized in that step (4-3) It specifically includes:

X=X1+ (X2-X1) * b/width (1)

Y=Y1+ (Y2-Y1) * (height-a+1)/height (2)

8. the construction method of aerosol document formatted data base as claimed in claim 4, which is characterized in that in step (4), When in scatter plot and line chart there are when multi-class data, firstly, determining that each categorical data is corresponding by way of cluster Color value, specifically: to image carry out edge analysis, extract scatter plot in scatterplot or the broken line in line chart figure, by time Border circular areas corresponding to the identified scatterplot center of circle and its radius is gone through to extract the corresponding pixel set of scatterplot, utilizes Kmeans Clustering procedure clusters the pixel value of scatterplot or broken line, records the pixel value of the cluster centre of k clustering cluster, and each poly- The pixel value of all pixels point in class cluster is revised as the pixel value of corresponding cluster centre, obtains each categorical data in this way Corresponding color value, while the pixel value of cluster centre being shown in the form of color bar；Then, corresponding according to classification Color value is modified every class data；

When the corresponding color value of the data category for receiving user's input, can realize to category data point coordinate information It extracts and classifies automatically.

9. the construction method of aerosol document formatted data base as described in any one of claims 1 to 8, which is characterized in that The construction method further includes providing crowdsourcing platform and knowledge services,

For crowdsourcing platform, user inputs data in literature, text statistical data by the crowdsourcing platform and uploads image, with reality Existing aerosol data in literature amplification；