CN111581928A - System and method for automatically constructing scientific and technological text analysis report with zero participation of user - Google Patents

System and method for automatically constructing scientific and technological text analysis report with zero participation of user Download PDF

Info

Publication number
CN111581928A
CN111581928A CN202010362633.2A CN202010362633A CN111581928A CN 111581928 A CN111581928 A CN 111581928A CN 202010362633 A CN202010362633 A CN 202010362633A CN 111581928 A CN111581928 A CN 111581928A
Authority
CN
China
Prior art keywords
data
text
scientific
graph
report
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010362633.2A
Other languages
Chinese (zh)
Other versions
CN111581928B (en
Inventor
汪雪锋
刘玉琴
刘佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202010362633.2A priority Critical patent/CN111581928B/en
Publication of CN111581928A publication Critical patent/CN111581928A/en
Application granted granted Critical
Publication of CN111581928B publication Critical patent/CN111581928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a system and a method for automatically constructing a scientific and technological text analysis report with zero participation of a user. By using the method and the system, the user can obtain the interpreted scientific and technological text analysis report by using the data of the scientific and technological text service provider without interacting with the analysis software and carrying out the operation related to the analysis software.

Description

System and method for automatically constructing scientific and technological text analysis report with zero participation of user
Technical Field
The invention relates to the field of computer-aided scientific and technical manuscript writing, in particular to a system and a method for automatically constructing a scientific and technical text analysis report with zero participation of a user.
Background
Scientific technology is rapidly developed, scientific research difficulty is gradually increased, interdiction among disciplines is achieved, cooperation and competition among researchers are achieved, scientific and technological text resources (such as scientific research papers, patents, scientific and technological reports and the like) in text formats are increased explosively, and new challenges are provided for scientific and technological data analysts, namely how to rapidly extract valuable scientific and technological information from the large amount of scientific and technological text resources which are different day by day and month and react as soon as possible. With the rapid development of information technology, some scientific and technological resource providers provide software tools to analyze and utilize scientific and technological texts thereof, and the software tools often present all or part of analysis results to users to construct an analysis report, so that the users can quickly know deep information contained in the scientific and technological texts.
There are two broad categories of how these analysis reports can be constructed:
one is user-active. The scientific and technological resource service provider provides scientific and technological text data and matched software tools, the software tools provide fixed analysis methods, a user forms an analysis result after selecting a certain analysis method and data, and then an analysis report is formed by a plurality of analysis results.
Second, the user is passive. The scientific and technological resource service provider provides scientific and technological text data and matched software tools, the software tools construct statistical tables and statistical graphs based on fixed templates, and the software tools show the analysis content which can be provided by the software tools to the user at one time by matching with the data in the simple text description tables and graphs.
The analysis report constructed in the first mode has the disadvantages that the user needs to continuously interact with the software tool provided by the scientific and technological resource service provider, the user needs to be familiar with the software tool of the service provider, and certain learning time and software tool operation time are consumed.
The analysis report constructed in the second way has the disadvantage that each statistical form and graph lacks deep text interpretation, which only explains what the content of the statistical form and the statistical graph is, and does not have deep interpretation, and especially some complex analysis graphs, such as a synthetic relation graph between organizations, an association relation graph between technical topics, and the like, often only contain the display of the graph and the title of the graph.
The analysis reports constructed in the above two ways all need the user to perform secondary reading, write characters for expression and organize and typeset the reports, which is time-consuming, labor-consuming and insufficient in depth.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a scientific and technological text analysis report automatic construction system and method with zero participation of a user, wherein the user can acquire a interpreted scientific and technological text analysis report by using data of a scientific and technological text service provider without interacting with analysis software or performing operation related to the analysis software.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a scientific and technological text analysis report automatic construction system with zero participation of a user, which comprises a field mapper, a binary data structure, a corresponding binary analysis result storage file, a data interpreter, a graph renderer, a report structure organizer and a report writer, wherein the field mapper is used for carrying out structural reorganization on a scientific and technological text to be analyzed;
the field mapper is used for carrying out structural reorganization on the scientific and technical text to be analyzed: the field mapper recombines the scientific and technological text to be analyzed according to 11 dimensions of serial number, author, organization, country, province, time, category 1, category 2, publication, subsidy project, keyword, title, abstract and full text;
a binary data structure for storing the analysis results and a corresponding binary analysis result storage file; the binary data structure for storing the analysis results includes:
11 field data structure: the system is used for storing the technical text quantity corresponding to authors, organizations, countries, provinces, time, category 1, category 2, publications, subsidized items, keywords and subject words, and the 11 fields are defined as basic dimensions; the content stored in the 11 field data structure is defined as a one-dimensional statistical result;
11 × 11 data table structures: the system is used for storing the number of the skill texts after the 11 basic dimensions are combined pairwise, and the contents stored by 11 × 11 data table structures are defined as two-dimensional statistical results;
8 graphic data structures: each graph data structure is composed of nodes and connecting lines and is respectively used for storing 8 dimensionality co-occurrence network graphs of authors, organizations, countries, provinces, categories 1 and 2, keywords and subject words; in a co-occurrence network diagram G (V, E) of each dimension, a node set V stores a content set of the corresponding dimension, and E stores the times of the content of each dimension appearing in the same scientific and technical text as a graph connecting line;
a data interpreter: the data interpreter is used for reading in an analysis result, automatically interpreting the analysis result and then outputting a segment of text description;
the one-dimensional statistical result is table type share data, the data interpreter interprets the table type share data according to the quantity and the proportion, and outputs text description for the data of N bits before the quantity and the proportion are sorted;
the two-dimensional statistical result with the time dimension is table trend data, and the data interpreter interprets the table trend data according to the overall trend, the maximum value, the minimum value and the increment rate of the data and outputs a text description;
the data interpreter interprets the co-occurrence network diagram according to the strength of the relation and outputs a text description;
a graph drawing device: the graph drawing device is used for reading in an analysis result and drawing a graph;
for the table type share data, a column diagram is adopted for drawing graphs;
for the table type trend data, adopting a line graph to carry out graphic drawing;
for the co-occurrence network diagram, drawing by adopting spherical nodes with characters and thick and thin connecting lines;
report structure organizer: the report structure organizer is used for defining and organizing the content and the structure of the output analysis report; the report structure organizer is defined with descriptors, and the descriptors are used for organizing the content and the structure of the analysis report and organizing the binary data structure needing connection in the report;
the report writer: the report writer is used for writing the analysis report according to the descriptor of the report structure organizer, calling the required binary data structure data when meeting the corresponding descriptor, and outputting the binary data structure data according to the descriptor.
Further, in the system, the subject word is a word group for computer word segmentation from the subject, abstract and body text of the scientific and technical text.
Further, in the system, the word output by the data interpreter for interpreting the share data of the table class is described as "0 of the N bits before sorting is {1} {2} {3} {4} {5}, the data amount is {6} {7} {8} {9} {10}, and the number ratio is {11} {12} {13} {14} {15}, respectively, where {0} is any one of the basic dimensions, {1} - {5} is the corresponding scientific text number, and {6} - {10} is the corresponding scientific text number ratio.
Further, in the above system, the text output by the data interpreter for interpreting the trend data of the table type is described as "increase/decrease of the overall trend/unobvious trend, the maximum number of XXXX years is X, the minimum number of XXXX years is X, and X is X, and the years with more remarkable growth include XXXX, yyyyy, and ZZZZ"; the overall trend is judged by calculating the slopes of different time periods, if the slope is more positive than negative, the overall trend is increased progressively, otherwise, the overall trend is decreased progressively, and if the slope is equal, the overall trend is not obvious; the maximum value and the minimum value are judged by comparing two numerical values; the years with more significant growth are ranked by growth rate, which is positive and the first three years are ranked as the years with more significant growth.
Further, in the above system, the interpretation of the co-occurrence network map by the data interpreter outputs the text description "{ 0} which is mainly divided into the following groups: x1, X2, X3 …; y1, Y2, Y3 …; z1, Z2 and Z3 …, wherein the groups with stronger relationship are i, j and k … groups; wherein {0} is 8 dimensions of author, organization, country, province, category 1, category 2, keyword and subject term, node is divided into groups by Kmeans clustering, and the group with the relationship greater than the median is the group with stronger relationship.
Further, in the above system, the descriptors are classified into 7 categories, which specifically include:
1) parameter descriptors, the basic format is:
param|id=;data=;where=;type=;
indicating that the analysis report should output parameters including number, binary data structure data, field and type;
2) paragraph descriptors, the basic format is:
paragraph|level=;linesbefore=;linespace=;charactersbefo re=;fontsize=;fontfamily=;italic=;bold=;align=;content=;
indicating that the analysis report should output a text segment, including the text content of the paragraph determined by the content and its format setting on the outline level, the segment front, the segment back, the line spacing, the font size, the italic, the bold, and the alignment;
3) the table descriptor, the basic format is:
tablestatic|name=;row=;column=;style=;data=;
indicating that the analysis report should output a table including table name, row number, column number, style, binary data structure data;
4) the dynamic graph descriptor has the basic format:
figuredynamic|name=;data=;param=;
indicating that the analysis report should output a non-network graph comprising a graph name, corresponding binary data structure data, a temporary storage path and parameters;
5) the network diagram descriptor has the following basic format:
figurememory|name=;func=;params=;save=;
indicating that the analysis report should output a co-occurrence network map including the name, data interpreter in step S4, parameters used for mapping, and temporarily stored path;
6) the transverse and longitudinal typesetting descriptor has the basic format:
segmentpage|orientation=;
indicating the paper direction of the analysis report in the current page layout;
7) directory descriptor, basic format is:
content|type=;
it is indicated that the analysis report should be output here as a directory, the type of directory being a full-text directory, a graph directory and/or an entry directory.
The invention also provides a method for automatically constructing the scientific and technical text analysis report by using the system, which comprises the following steps:
s1, searching in a scientific and technological text database by the user, and inputting the scientific and technological text to be analyzed obtained by searching into a scientific and technological text analysis report automatic construction system with zero participation of the user;
s2, carrying out structural reorganization on the scientific and technical text to be analyzed by the field mapper:
the technical text to be analyzed by the field mapper is recombined according to the serial number, the author, the organization, the country, the province, the time, the category 1, the category 2, the publication, the subsidy item, the keyword, the title, the abstract and the full text;
s3, counting the number of scientific and technological texts corresponding to 11 basic dimensions of an author, an organization, a country, a province, time, a category 1, a category 2, a publication, a subsidy project, a keyword and a subject term in the scientific and technological text to be analyzed after the structure is recombined according to a field data structure in a binary data structure to obtain a one-dimensional statistical result;
counting the number of technical texts to be analyzed, which are subjected to two-two combination of 11 basic dimensions of authors, organizations, countries, provinces, time, category 1, category 2, publications, subsidy items, keywords and subject terms, in the technical texts to be analyzed after the structure is reorganized according to the field data structure in the binary data structure, and obtaining a two-dimensional statistical result;
according to a graph data structure in a binary data structure, counting 8-dimensional co-occurrence network diagrams of authors, organizations, countries, provinces, category 1, category 2, keywords and subject words in a scientific and technical text to be analyzed after the structure is recombined;
storing the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram into a binary analysis result storage file;
s4, the graph drawing device reads the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network graph obtained in the step S3 and draws the graph, wherein:
the one-dimensional statistical result is table type share data, and the graph of the table type share data is drawn by adopting a bar chart and stored in a temporary directory;
the two-dimensional statistical result with the time dimension is table trend data, and a graph of the table trend data is drawn by adopting a line drawing and stored in a temporary directory;
for the graph of the co-occurrence network graph G (V, E), drawing by using spherical nodes with characters and thick and thin connecting lines and storing the drawing in a temporary directory;
in addition, the data interpreter reads in the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram obtained in the step S3, automatically interprets the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram, and outputs a segment of text description;
the one-dimensional statistical result is table type share data, the data interpreter interprets the table type share data according to the quantity and the proportion, and outputs text description for the data of N bits before the quantity and the proportion are sorted;
the two-dimensional statistical result with the time dimension is table trend data, and the data interpreter interprets the table trend data according to the overall trend, the maximum value, the minimum value and the increment rate of the data and outputs a text description;
the data interpreter interprets the co-occurrence network diagram according to the strength of the relation and outputs a text description;
s5, the report structure organizer defines and organizes the content and structure of the output analysis report, wherein the descriptor defined in the report structure organizer organizes the content and structure of the analysis report;
the report writer writes the analysis report according to the descriptor of the report structure organizer, calls corresponding binary data structure data when encountering the corresponding descriptor, outputs the binary data structure data according to the descriptor description, and finally generates the required scientific and technical text analysis report.
Further, in step S4 of the method, the data interpreter interprets the data as follows:
the one-dimensional statistical result is table class share data, the interpretation of the table class share data is interpreted according to quantity and proportion, the output text description of data of N bits before sequencing is {1} {2} {3} {4} {5}, the data quantity is {6} {7} {8} {9} {10}, the quantity proportion is {11} {12} {13} {14} {15}, wherein {0} is any one of basic dimensions, {1} - {5} is corresponding scientific text quantity, and {6} - {10} is corresponding scientific text quantity proportion;
the two-dimensional statistical result with time dimension is table trend data, for the table trend data, the data interpreter interprets the data according to the overall trend, the maximum value, the minimum value and the increment rate of the data, and the output text is described as 'the overall trend is increased/decreased/trend is not obvious, the maximum XXXX year is reached, the number is X, the XXXX year is minimum, the X year is X, and the years with more obvious growth comprise XXXXXXXX, YYYY and ZZ'; the overall trend is judged by calculating the slopes of different time periods, if the slope is more positive than negative, the overall trend is increased progressively, otherwise, the overall trend is decreased progressively, and if the slope is equal, the overall trend is not obvious; the maximum value and the minimum value are judged by comparing two numerical values; the years with more remarkable growth are sorted by the growth rate, the growth rate is positive, and the years with the first three of the sorting are the years with more remarkable growth;
the interpretation output text of the co-occurrence network diagram is described as "{ 0} which is mainly divided into the following groups: x1, X2, X3 …; y1, Y2, Y3 …; z1, Z2 and Z3 …, wherein the groups with stronger relationship are i, j and k … groups; wherein {0} is 8 dimensions of author, organization, country, province, category 1, category 2, key word and subject word, node is divided into groups by Kmeans clustering in the judgment of the groups, and the group with the relationship larger than the median is the group with stronger relationship.
Further, in step S5 of the above method, in the report structure organizer:
the parameter descriptor indicates that the analysis report should output parameters including number, binary data structure data, field and type;
the paragraph descriptor indicates that the analysis report should output a paragraph of text, including the paragraph text content and its format settings on outline level, paragraph front, paragraph back, line spacing, font size, italics, bold, and alignment;
the table descriptor indicates that the analysis report should output a table, including table name, row number, column number, style, and binary data structure data;
the dynamic graph descriptor indicates that the analysis report should output a non-network graph comprising a graph name, data in a binary data structure, a temporary storage path and parameters;
the network map descriptor indicates that the analysis report should output a co-occurrence network map, including name, data interpreter, parameters used for drawing and temporarily stored path;
the horizontal and vertical typesetting descriptors indicate the paper direction of the analysis report in the current page typesetting;
the directory descriptor specifies that the analysis report should output a directory therein, the type of directory being a full text directory, a graph directory and a table directory.
Drawings
Fig. 1 is a schematic flow chart of embodiment 2 of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, wherein the embodiments are based on the technical solution, and detailed embodiments and specific operation procedures are provided, but the protection scope of the present invention is not limited to the embodiments.
Example 1
The embodiment provides a scientific and technical text analysis report automatic construction system with zero participation of a user, which comprises:
the field mapper is used for carrying out structural reorganization on the scientific and technical text to be analyzed: the field mapper recombines the scientific and technological text to be analyzed according to 11 dimensions of serial number, author, organization, country, province, time, category 1, category 2, publication, subsidy project, keyword, title, abstract and full text; the field mapper is as follows:
Figure BDA0002475630740000111
Figure BDA0002475630740000121
the technical text to be analyzed is structurally recombined through the field mapper so as to standardize the technical text data and facilitate analysis.
A binary data structure for storing the analysis results and a corresponding binary analysis result storage file:
the binary data structure for storing the analysis results includes:
11 field data structure: the system is used for storing the technical text quantity corresponding to authors, organizations, countries, provinces, time, category 1, category 2, publications, subsidized items, keywords and subject words, and the 11 fields are defined as basic dimensions, wherein the subject words are phrases for computer word segmentation from subjects, abstracts and texts; the content stored in the 11 field data structure is defined as a one-dimensional statistical result;
11 × 11 data table structures: the system is used for storing the number of the skill texts after the 11 basic dimensions are combined pairwise, and the contents stored by 11 × 11 data table structures are defined as two-dimensional statistical results;
8 graphic data structures: each graph data structure is composed of nodes and connecting lines and is respectively used for storing 8 dimensionality co-occurrence network graphs of authors, organizations, countries, provinces, categories 1 and 2, keywords and subject words; in a co-occurrence network diagram G (V, E) of each dimension, a node set V stores a content set of the corresponding dimension, and E stores the times of the content of each dimension appearing in the same scientific and technical text as a graph connecting line;
taking the dimension of the author as an example, in the co-occurrence network graph G (V, E) of the dimension, the node set V stores a set of names of the author as graph nodes, and E stores times of appearance of the authors in the same scientific text together as graph connecting lines.
A data interpreter: the data interpreter is used for reading in an analysis result, automatically interpreting the analysis result and then outputting a segment of text description; thus realizing the 'zero participation' of the user for interpretation of the analysis result.
The data interpreter interprets table class share data, table class trend data, and network data:
1) interpretation of tabular class share data:
the one-dimensional statistical result is table class share data, the interpretation of the table class share data is carried out according to quantity and proportion, the data output word description of N bits before sequencing is {1} {2} {3} {4} {5}, the data quantity is {6} {7} {8} {9} {10}, the quantity proportion is {11} {12} {13} {14} {15}, wherein {0} is any one of basic dimensions, {1} - {5} is a corresponding scientific text quantity, and {6} - {10} is a corresponding scientific text quantity proportion;
2) interpretation of tabular class trend data:
in this embodiment, the two-dimensional statistical result with the time dimension is table-type trend data, the interpretation of the table-type trend data is performed according to the overall trend, the maximum value, the minimum value and the increment rate, and the output text is described as "the overall trend is increased/decreased/the trend is not obvious, XXXX years reaches the most, the number is X, XXXX years is the least, X is X, and the years with more obvious growth include XXXX, YYYY and zz"; the overall trend is judged by calculating the slopes of different time periods, if the slope is more positive than negative, the overall trend is increased progressively, otherwise, the overall trend is decreased progressively, and if the slope is equal, the overall trend is not obvious; the maximum value and the minimum value are judged by comparing two numerical values; the years with more significant growth are ranked by growth rate, with the growth rate being positive and the first three years ranked being the years with more significant growth.
3) Interpretation of co-occurrence network graph G (V, E):
the interpretation output text of the co-occurrence network diagram is described as "{ 0} which is mainly divided into the following groups: x1, X2, X3 …; y1, Y2, Y3 …; z1, Z2 and Z3 …, wherein the groups with stronger relationship are i, j and k … groups; wherein {0} is 8 dimensions of author, organization, country, province, category 1, category 2, key word and subject word, node is divided into groups by Kmeans clustering in the judgment of the groups, and the group with the relationship larger than the median is the group with stronger relationship.
A graph drawing device: the graph drawing device is used for reading in an analysis result and drawing a graph, and realizes zero participation of a user for drawing the graph;
the graphs drawn by the graph drawing device are divided into three types, specifically:
1) tabular class share data graph: and drawing the table class share data by adopting a column diagram, and storing the table class share data into a temporary directory for use by a report structure organizer.
2) Graph of trend data in table type: the table type trend data is drawn by a line chart and stored in a temporary directory for use by a report structure organizer.
3) Graphical rendering of co-occurrence network graph G (V, E): drawing a co-occurrence network diagram by adopting spherical nodes with characters and thick and thin connecting lines, and adopting an elastic model Spring-Embedded model algorithm for node layout[1]And improving the elastic model Fruchterman-Reingoldlayout algorithm[2]And (4) performing the operation, wherein each algorithm is circularly set for times, outputting a co-occurrence network diagram, and storing the co-occurrence network diagram into a temporary directory for reporting the use of the structure organizer.
Report structure organizer: the report structure organizer is used for limiting and organizing the structure of the analysis report needing to be output; user "zero participation" in the organization of the analysis report structure is achieved.
The report structure organizer has a descriptor defined therein, the descriptor being used for organizing the structure of the analysis report.
The descriptor is basically configured as "description type | precondition, description content format limitation, parameter setting and data setting", and is divided into 7 categories, specifically including:
1) parameter descriptors, the basic format is:
param|id=;data=;where=;type=;
indicating that the analysis report should output parameters including number, binary data structure data, field and type;
the serial numbers are numbered according to the numerical serial number, 1,2,3 ….
2) Paragraph descriptors, the basic format is:
paragraph|level=;linesbefore=;linespace=;charactersbefo re=;fontsize=;fontfamily=;italic=;bold=;align=;content=;
indicating that the analysis report should output a text segment, including the text content of the paragraph determined by the content and its format setting on the outline level, the segment front, the segment back, the line spacing, the font size, the italic, the bold, and the alignment;
3) the table descriptor, the basic format is:
tablestatic|name=;row=;column=;style=;data=;
indicating that the analysis report should output a table including name, row number, column number, pattern, corresponding binary data structure data.
4) The dynamic graph descriptor has the basic format:
figuredynamic|name=;data=;param=;save=;
a non-network graph is output here, which indicates that the analysis report should include the name, data in the corresponding binary data structure, parameters, and temporary storage paths.
5) The network diagram descriptor has the following basic format:
figurememory|name=;func=;params=;save=;
indicating that the analysis report should output a co-occurrence network map including the name, data interpreter in step S4, parameters used for mapping, and temporarily stored paths.
6) The transverse and longitudinal typesetting descriptor has the basic format:
segmentpage|orientation=;
indicating the paper direction (horizontal or vertical) of the analysis report in the current page layout;
7) directory descriptor, basic format is:
content|type=;
indicating that the analysis report should output a directory, wherein the directory is of a full-text directory, a graph directory and a table directory;
the report writer: the report writer is used for writing the analysis report according to the descriptor of the report structure organizer, and when meeting the corresponding descriptor, the report writer calls binary data structure data and outputs the binary data structure data according to the description of the descriptor.
Example 2
The embodiment provides a method for automatically constructing a scientific and technical text analysis report by using the system described in embodiment 1, as shown in fig. 1, the method includes the following steps:
s1, searching in a scientific and technological text database by the user, and inputting the scientific and technological text to be analyzed obtained by searching into a scientific and technological text analysis report automatic construction system with zero participation of the user;
s2, carrying out structural reorganization on the scientific and technical text to be analyzed by the field mapper:
the technical text to be analyzed by the field mapper is recombined according to the serial number, the author, the organization, the country, the province, the time, the category 1, the category 2, the publication, the subsidy item, the keyword, the title, the abstract and the full text;
s3, counting the number of scientific and technological texts corresponding to 11 basic dimensions of an author, an organization, a country, a province, time, a category 1, a category 2, a publication, a subsidy project, a keyword and a subject term in the scientific and technological text to be analyzed after the structure is recombined according to a field data structure in a binary data structure to obtain a one-dimensional statistical result;
counting the number of technical texts to be analyzed, which are subjected to two-two combination of 11 basic dimensions of authors, organizations, countries, provinces, time, category 1, category 2, publications, subsidy items, keywords and subject terms, in the technical texts to be analyzed after the structure is reorganized according to the field data structure in the binary data structure, and obtaining a two-dimensional statistical result;
according to a graph data structure in a binary data structure, counting 8-dimensional co-occurrence network diagrams of authors, organizations, countries, provinces, category 1, category 2, keywords and subject words in a scientific and technical text to be analyzed after the structure is recombined;
storing the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram into a binary analysis result storage file;
s4, the graph drawing device reads the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network graph obtained in the step S3 and draws the graph, wherein:
the one-dimensional statistical result is table type share data, and the graph of the table type share data is drawn by adopting a bar chart and stored in a temporary directory;
the two-dimensional statistical result with the time dimension is table trend data, and a graph of the table trend data is drawn by adopting a line drawing and stored in a temporary directory;
for the graph of the co-occurrence network graph G (V, E), spherical nodes with characters and thick and thin connecting lines are used for drawing, and the layout of the nodes adopts an elastic Model Spring-Embedded Model algorithm[1]And improving the elastic model Fruchterman-Reingoldlayout algorithm[2]Performing, wherein each algorithm is circulated for a set number of times, outputting a co-occurrence network diagram and storing the co-occurrence network diagram in a temporary directory;
in addition, the data interpreter reads in the one-dimensional statistical result, the two-dimensional statistical result and the data of the co-occurrence network diagram obtained in the step S3, automatically interprets the data, and outputs a segment of text description; wherein:
the one-dimensional statistical result is table class share data, the interpretation of the table class share data is carried out according to quantity and proportion, the data output word description of N bits before sequencing is {1} {2} {3} {4} {5}, the data quantity is {6} {7} {8} {9} {10}, the quantity proportion is {11} {12} {13} {14} {15}, wherein {0} is any one of basic dimensions, {1} - {5} is a corresponding scientific text quantity, and {6} - {10} is a corresponding scientific text quantity proportion;
the two-dimensional statistical result with the time dimension is table trend data, for the interpretation of the table trend data, the data interpreter interprets the data according to the overall trend, the maximum value, the minimum value and the increment rate, and the output text is described as 'the overall trend is increased/decreased/trend is not obvious, the maximum XXXX year is reached, the number is X, the XXXX year is minimum, the X year is X, and the years with more obvious growth comprise XXXXXXXX, YYYY and ZZ'; wherein, the judgment of the whole trend is carried out by calculating the slopes of different time periods, and the slope calculation formula is as follows:
slope ═ Y (t) -Y (t-1))/(X (t) -X (t-1));
t is the number of years in time, Y is the data value at t, (X (t) -X (t-1)) is the time span, which can be defined as 1 year.
If the slope is positive more than the slope is negative, the overall trend is increased progressively, otherwise, the overall trend is decreased progressively, and if the slope is equal, the overall trend is not obvious; the maximum value and the minimum value are judged by comparing two numerical values; the years with more prominent growth are ranked by growth rate, which is positive and the first three years ranked are the years with more prominent growth.
The interpretation output text of the co-occurrence network diagram is described as "{ 0} which is mainly divided into the following groups: x1, X2, X3 …; y1, Y2, Y3 …; z1, Z2 and Z3 …, wherein the groups with stronger relationship are i, j and k … groups; wherein {0} is 8 dimensions of author, organization, country, province, category 1, category 2, key word and subject word, the grouping is judged by adopting Kmeans clustering to divide the nodes into groups, and the group with the relationship larger than the median is the group with stronger relationship;
and S5, a report structure organizer defines and organizes the content and structure of the output analysis report, wherein the descriptor defined in the report structure organizer organizes the structure of the analysis report, and the data structure in the step S2, which needs to be connected, in the report.
The parameter descriptor indicates the parameters that the analysis report should output here, including what the number of the parameter is, what data, what field, and what type are involved in the application binary data structure;
the paragraph descriptor indicates that the analysis report should output a paragraph of text, including the content-determined paragraph text content and its formatting on outline level, paragraph front, paragraph back, line spacing, font size, italics, bold, and alignment;
the table descriptor indicates that the analysis report should output a table including name, row number, column number, style, and corresponding binary data structure data.
The dynamic graph descriptor indicates that the analysis report should output a non-network graph including a name, data in a corresponding binary data structure, temporary storage path, parameters, and so on.
The network map descriptor indicates that the analysis report should output a co-occurrence network map, including name, data interpreter, parameters used for drawing and temporarily stored path;
the horizontal and vertical layout descriptor indicates the paper direction (horizontal or vertical) of the analysis report in the current page layout;
the directory descriptor indicates that the analysis report should output a directory, and the directory is a full text directory, a graph directory and a table directory;
the report writer writes the analysis report according to the descriptor of the report structure organizer, calls binary data structure data when encountering the corresponding descriptor, outputs the binary data structure data according to the descriptor description, and finally generates the required scientific and technical text analysis report.
Specifically, the mail can be sent to a set electronic mailbox after the scientific and technical text analysis report is written.
Various corresponding changes and modifications can be made by those skilled in the art according to the above technical solutions and concepts, and all such changes and modifications should be included in the scope of the present invention as claimed.

Claims (9)

1. The automatic construction system of scientific and technological text analysis reports with zero participation of users is characterized by comprising a field mapper for carrying out structural reorganization on scientific and technological texts to be analyzed, a binary data structure for storing analysis results, a corresponding binary analysis result storage file, a data interpreter, a graph renderer, a report structure organizer and a report writer;
the field mapper is used for carrying out structural reorganization on the scientific and technical text to be analyzed: the field mapper recombines the scientific and technological text to be analyzed according to 11 dimensions of serial number, author, organization, country, province, time, category 1, category 2, publication, subsidy project, keyword, title, abstract and full text;
a binary data structure for storing the analysis results and a corresponding binary analysis result storage file; the binary data structure for storing the analysis results includes:
11 field data structure: the system is used for storing the technical text quantity corresponding to authors, organizations, countries, provinces, time, category 1, category 2, publications, subsidized items, keywords and subject words, and the 11 fields are defined as basic dimensions; the content stored in the 11 field data structure is defined as a one-dimensional statistical result;
11 × 11 data table structures: the system is used for storing the number of the scientific and technological texts after the 11 basic dimensions are combined pairwise, and the contents stored by the 11 × 11 data table structures are defined as two-dimensional statistical results;
8 graphic data structures: each graph data structure is composed of nodes and connecting lines and is respectively used for storing 8 dimensionality co-occurrence network graphs of an author, an organization, a country, a province, a category 1, a category 2, a keyword and a subject word; in a co-occurrence network graph G (V, E) of each dimension, a node set V stores a content set of the corresponding dimension, and E stores the times of the content of each dimension appearing in the same scientific and technical text as a graph connecting line;
a data interpreter: the data interpreter is used for reading in an analysis result, automatically interpreting the analysis result and then outputting a segment of text description;
the one-dimensional statistical result is table type share data, the data interpreter interprets the table type share data according to the quantity and the proportion, and outputs literal description for the data with N bits before the quantity and the proportion are sorted;
the two-dimensional statistical result with the time dimension is table trend data, and the data interpreter interprets the table trend data according to the overall trend, the maximum value, the minimum value and the increment rate of the data and outputs a text description;
the data interpreter interprets the co-occurrence network diagram according to the strength of the relation and outputs a text description;
a graph drawing device: the graph drawing device is used for reading in an analysis result and drawing a graph;
for the table type share data, a column diagram is adopted for drawing graphs;
for the table type trend data, adopting a line graph to carry out graphic drawing;
for the co-occurrence network diagram, drawing by adopting spherical nodes with characters and thick and thin connecting lines;
report structure organizer: the report structure organizer is used for defining and organizing the content and the structure of the output analysis report; the report structure organizer defines a descriptor, wherein the descriptor is used for organizing the content and the structure of the analysis report and organizing a binary data structure needing to be connected in the report;
the report writer: the report writer is used for writing the analysis report according to the descriptor of the report structure organizer, and when the corresponding descriptor is encountered, the required binary data structure data is called and output according to the descriptor.
2. The system of claim 1, wherein the subject word is a phrase segmented by computer from the title, abstract, and text of scientific text.
3. The system according to claim 1, wherein the word output by the data interpreter for the interpretation of the share data of the table class is described as "0 of the N bits before the sorting is {1} {2} {3} {4} {5}, respectively, and the data amount is {6} {7} {8} {9} {10}, respectively, and the number is in proportion to {11} {12} {13} {14} {15}, respectively, where {0} is any one of the basic dimensions, {1} - {5} is the corresponding scientific text number, and {6} - {10} is the corresponding scientific text number.
4. The system of claim 1, wherein the interpretation of the tabular trend data by the data interpreter outputs a textual description of "increase/decrease of global trend/trend unobvious, with XXXX years up to a maximum, number X, XXXX years at a minimum, X, and a more significant increase including XXXX, YYYY, ZZZZ"; the overall trend is judged by calculating the slopes of different time periods, if the slope is more positive than negative, the overall trend is increased progressively, otherwise, the overall trend is decreased progressively, and if the slope is equal, the overall trend is not obvious; the maximum value and the minimum value are judged by comparing two numerical values; the years with more significant growth are ranked by growth rate, with the growth rate being positive and the first three years ranked being the years with more significant growth.
5. The system of claim 1, wherein the interpretation of the co-occurrence network map by the data interpreter outputs a textual description "{ 0} that is largely divided into the following groups: x1, X2, X3 …; y1, Y2, Y3 …; z1, Z2 and Z3 …, wherein the groups with stronger relationship are i, j and k … groups; wherein {0} is 8 dimensions of author, organization, country, province, category 1, category 2, keyword and subject term, node is divided into groups by Kmeans clustering, and the group with the relationship greater than the median is the group with stronger relationship.
6. The system of claim 1, wherein the descriptors are classified into 7 categories, including:
1) parameter descriptors, the basic format is:
param|id=;data=;where=;type=;
indicating that the analysis report should output parameters including number, binary data structure data, field and type;
2) paragraph descriptors, the basic format is:
paragraph|level=;linesbefore=;linespace=;charactersbefo re=;fontsize=;fontfamily=;italic=;bold=;align=;content=;
indicating that the analysis report should output a text segment including the text content of the text segment determined by the content and its format setting on outline level, segment front, segment back, line spacing, font size, italics, bold, and alignment;
3) the table descriptor, the basic format is:
tablestatic|name=;row=;column=;style=;data=;
indicating that the analysis report should output a table including table name, row number, column number, style, binary data structure data;
4) the dynamic graph descriptor has the basic format:
figuredynamic|name=;data=;param=;
indicating that the analysis report should output a non-network graph comprising a graph name, corresponding binary data structure data, a temporary storage path and parameters;
5) the network diagram descriptor has the following basic format:
figurememory|name=;func=;params=;save=;
indicating that the analysis report should output a co-occurrence network map including the name, data interpreter in step S4, parameters used for mapping, and temporarily stored path;
6) the transverse and longitudinal typesetting descriptor has the basic format:
segmentpage|orientation=;
indicating the paper direction of the analysis report in the current page layout;
7) directory descriptor, basic format is:
content|type=;
it is indicated that the analysis report should be output here as a directory, the type of directory being a full-text directory, a graph directory and/or an entry directory.
7. A method for automatic construction of scientific text analysis reports using the system of any one of claim 1, comprising the steps of:
s1, searching in a scientific and technological text database by the user, and inputting the scientific and technological text to be analyzed obtained by searching into a scientific and technological text analysis report automatic construction system with zero participation of the user;
s2, carrying out structural reorganization on the scientific and technical text to be analyzed by the field mapper:
the technical text to be analyzed by the field mapper is recombined according to the serial number, the author, the organization, the country, the province, the time, the category 1, the category 2, the publication, the subsidy project, the key word, the title, the abstract and the full text;
s3, counting the number of scientific and technological texts corresponding to 11 basic dimensions of an author, an organization, a country, a province, time, a category 1, a category 2, a publication, a subsidy project, a keyword and a subject term in the scientific and technological text to be analyzed after the structure is recombined according to a field data structure in a binary data structure to obtain a one-dimensional statistical result;
counting the number of scientific and technological texts which are subjected to two-two combination of 11 basic dimensions of authors, organizations, countries, provinces, time, category 1, category 2, publications, subsidy items, keywords and subject words in the scientific and technological texts to be analyzed after the structure is recombined according to a field data structure in a binary data structure to obtain a two-dimensional statistical result;
according to a graph data structure in a binary data structure, counting 8 dimensionality co-occurrence network diagrams of authors, organizations, countries, provinces, category 1, category 2, keywords and subject words in the scientific and technical text to be analyzed after the structure is recombined;
storing the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram into a binary analysis result storage file;
s4, the graph drawing device reads the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network graph obtained in the step S3 and draws the graph, wherein:
the one-dimensional statistical result is table type share data, and a graph of the table type share data is drawn by adopting a bar chart and stored in a temporary directory;
the two-dimensional statistical result with the time dimension is table type trend data, and a graph of the table type trend data is drawn by adopting a line graph and is stored in a temporary directory;
drawing the graph of the co-occurrence network graph G (V, E) by adopting spherical nodes with characters and thick and thin connecting lines and storing the graph into a temporary directory;
in addition, the data interpreter reads in the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram obtained in the step S3, automatically interprets the one-dimensional statistical result, the two-dimensional statistical result and the co-occurrence network diagram, and outputs a segment of text description;
the one-dimensional statistical result is table type share data, the data interpreter interprets the table type share data according to the quantity and the proportion, and outputs literal description for the data with N bits before the quantity and the proportion are sorted;
the two-dimensional statistical result with the time dimension is table trend data, and the data interpreter interprets the table trend data according to the overall trend, the maximum value, the minimum value and the increment rate of the data and outputs a text description;
the data interpreter interprets the co-occurrence network diagram according to the strength of the relation and outputs a text description;
s5, the report structure organizer defines and organizes the content and structure of the output analysis report, wherein the descriptor defined in the report structure organizer organizes the content and structure of the analysis report;
the report writer writes the analysis report according to the descriptor of the report structure organizer, calls corresponding binary data structure data when encountering the corresponding descriptor, outputs the binary data structure data according to the descriptor description, and finally generates the required scientific and technical text analysis report.
8. The method as claimed in claim 7, wherein in step S4, the interpretation process of the data interpreter is as follows:
the one-dimensional statistical result is table class share data, the interpretation of the table class share data is interpreted according to quantity and proportion, the output text description of data with N bits before sequencing is {0} of {1} {2} {3} {4} {5}, the data quantity is {6} {7} {8} {9} {10}, the quantity proportion is {11} {12} {13} {14} {15}, respectively, wherein {0} is any one of basic dimensions, {1} - {5} is the corresponding quantity of scientific text, and {6} - {10} is the corresponding quantity of scientific text;
the two-dimensional statistical result with the time dimension is table trend data, for the table trend data, a data interpreter interprets the data according to the overall trend, the maximum value, the minimum value and the increment rate of the data, and output characters are described as 'the overall trend is increased/decreased/trend is not obvious, the maximum XXXX year is reached, the number is X, the XXXX year is minimum, the X year is X, and the years with obvious growth comprise XXXXXXXX, YYYYY and ZZ'; the overall trend is judged by calculating the slopes of different time periods, if the slope is more positive than negative, the overall trend is increased progressively, otherwise, the overall trend is decreased progressively, and if the slope is equal, the overall trend is not obvious; the maximum value and the minimum value are judged by comparing two numerical values; the years with more significant growth are ranked by growth rate, the growth rate is positive and the first three years are ranked as the years with more significant growth;
the interpretation output text of the co-occurrence network diagram is described as "{ 0} which is mainly divided into the following groups: x1, X2, X3 …; y1, Y2, Y3 …; z1, Z2 and Z3 …, wherein the groups with stronger relationship are i, j and k … groups; wherein {0} is 8 dimensions of author, organization, country, province, category 1, category 2, keyword and subject term, node is divided into groups by Kmeans clustering, and the group with the relationship greater than the median is the group with stronger relationship.
9. The method according to claim 7, wherein in step S5, in the report structure organizer:
the parameter descriptor indicates that the analysis report should output parameters including number, binary data structure data, field and type;
the paragraph descriptor indicates that the analysis report should output a paragraph of text, including the paragraph text content and its formatting on outline level, paragraph front, paragraph back, line spacing, font size, italics, bold, and alignment;
the table descriptor indicates that the analysis report should output a table, including table name, row number, column number, style, and binary data structure data;
the dynamic graph descriptor indicates that the analysis report should output a non-network graph comprising a graph name, data in a binary data structure, a temporary storage path and parameters;
the network map descriptor indicates that the analysis report should output a co-occurrence network map, including name, data interpreter, parameters used for drawing, and temporarily stored path;
the horizontal and vertical typesetting descriptors indicate the paper direction of the analysis report in the current page typesetting;
the directory descriptor indicates that the analysis report should output a directory therein, and the types of the directory are a full-text directory, a graph directory and a table directory.
CN202010362633.2A 2020-04-30 2020-04-30 System and method for automatically constructing scientific and technological text analysis report with zero participation of user Active CN111581928B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010362633.2A CN111581928B (en) 2020-04-30 2020-04-30 System and method for automatically constructing scientific and technological text analysis report with zero participation of user

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010362633.2A CN111581928B (en) 2020-04-30 2020-04-30 System and method for automatically constructing scientific and technological text analysis report with zero participation of user

Publications (2)

Publication Number Publication Date
CN111581928A true CN111581928A (en) 2020-08-25
CN111581928B CN111581928B (en) 2022-03-01

Family

ID=72120686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010362633.2A Active CN111581928B (en) 2020-04-30 2020-04-30 System and method for automatically constructing scientific and technological text analysis report with zero participation of user

Country Status (1)

Country Link
CN (1) CN111581928B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796405A (en) * 2023-02-03 2023-03-14 阿里巴巴达摩院(杭州)科技有限公司 Solution report generation method for optimization model and computing equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040158578A1 (en) * 2002-12-31 2004-08-12 Chung-I Lee System and method for generating structured information reports
CN109800397A (en) * 2017-11-16 2019-05-24 北大方正集团有限公司 Data analysis report automatic generation method, device, computer equipment and medium
CN110400101A (en) * 2019-08-21 2019-11-01 苏州经贸职业技术学院 Industry reports analysis system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040158578A1 (en) * 2002-12-31 2004-08-12 Chung-I Lee System and method for generating structured information reports
CN109800397A (en) * 2017-11-16 2019-05-24 北大方正集团有限公司 Data analysis report automatic generation method, device, computer equipment and medium
CN110400101A (en) * 2019-08-21 2019-11-01 苏州经贸职业技术学院 Industry reports analysis system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796405A (en) * 2023-02-03 2023-03-14 阿里巴巴达摩院(杭州)科技有限公司 Solution report generation method for optimization model and computing equipment

Also Published As

Publication number Publication date
CN111581928B (en) 2022-03-01

Similar Documents

Publication Publication Date Title
Günther et al. Word counts and topic models: Automated text analysis methods for digital journalism research
US7930322B2 (en) Text based schema discovery and information extraction
US7603351B2 (en) Semantic reconstruction
Hurst The interpretation of tables in texts
CN101079024B (en) Special word list dynamic generation system and method
CN100447779C (en) Document information processing apparatus, document information processing method, and document information processing program
JP4343213B2 (en) Document processing apparatus and document processing method
CN105808526A (en) Commodity short text core word extracting method and device
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN108664574A (en) Input method, terminal device and the medium of information
CN102831131B (en) Method and device for establishing labeling webpage linguistic corpus
CN101673306A (en) Website information query method and system thereof
CN117112806B (en) Knowledge graph-based information structuring method and device
CN108959204B (en) Internet financial project information extraction method and system
CN114064851A (en) Multi-machine retrieval method and system for government office documents
CN111753536A (en) Automatic patent application text writing method and device
CN111581928B (en) System and method for automatically constructing scientific and technological text analysis report with zero participation of user
CN115827862A (en) Associated acquisition method for multivariate expense voucher data
Long An agent-based approach to table recognition and interpretation
CN112148735B (en) Construction method for structured form data knowledge graph
CN112199960A (en) Standard knowledge element granularity analysis system
CN102479072B (en) Multi-header report generating method, device and terminal
CN108829698A (en) Government system dispatch method, apparatus, computer equipment and storage medium
JP2013016036A (en) Document component generation method and computer system
CN116644740A (en) Dictionary automatic extraction method and system based on single text term solidification degree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant