CN108959626B

CN108959626B - Efficient automatic generation method for cross-platform heterogeneous data profile

Info

Publication number: CN108959626B
Application number: CN201810811216.4A
Authority: CN
Inventors: 尹健康; 张卫东; 宋红文; 刘宁; 洪海舟; 贺红梅
Original assignee: Chengdu Co Of Sichuan Tobacco Co
Current assignee: Chengdu Co Of Sichuan Tobacco Co
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2023-06-13
Anticipated expiration: 2038-07-23
Also published as: CN108959626A

Abstract

The invention discloses a high-efficiency automatic generation method of cross-platform heterogeneous data profiling, which comprises the following steps of 1, processing mass heterogeneous data, and adopting an SX404DB key-value type database to centrally manage the data; the SX404DB key-value type database is a key-value NoSQL database based on an inverted index technology; 2. automatically generating the briefing content, and dynamically generating the briefing content through a docmentscript control system; 3. the automatic typesetting of the presentation format is completed by injecting content into a format template based on Office OpenXML and compressing the content into a DOCX format document. Compared with the prior art, the method for generating the briefing has the characteristics of supporting massive heterogeneous data, flexible and extensible content generation mode, stable briefing format and high compatibility, and has good stability, operability and expansibility.

Description

Efficient automatic generation method for cross-platform heterogeneous data profile

Technical Field

The invention relates to a computer data processing method, in particular to a cross-platform heterogeneous data simple and efficient automatic generation method.

Background

The tobacco industry is an industry with huge market structure and organization structure, in the tobacco operation, links of tobacco market supervision, tobacco production, tobacco sales and the like are involved, and a plurality of tobacco sales sites exist on the market, so that the management of the tobacco industry is always a work with huge workload.

In tobacco management, the management of calculation data has already been realized, but in the management work, not only how to normalize each link flow, but also how to provide a brief report for the management unit, which faithfully reflects the current brief data of the tobacco industry, and provide management references for management decision makers.

Office automation has been an important area of computer application, and electronic manuscript technology is a branch of this area that is important. A narrow sense of an electronic manuscript refers to a digital manuscript that can be read, edited, or published on an electronic device, while a broad sense of an electronic manuscript refers to a digital document of all multimedia. The electronic bulletin is a special electronic manuscript specially used for reporting and describing, and is widely applied to various fields such as telecommunication, traffic, finance, education, geology and the like. Such an electronic manuscript is a functional manuscript, and the electronic manuscripts of the same kind have great similarity in content and form, and even sometimes, other contents are almost identical except for data. The writing of the bulletin often takes a lot of manual labor time, and the research work of the automatic bulletin generating method is very significant.

The automatic generation work of the briefing can be divided into three layers: data organization, presentation content generation and format typesetting. In conventional applications, the data sources often choose relational databases, and the high degree of transactional consistency and ease of use of relational databases is undoubted. Modern electronic bulletins are, however, mostly not statistically rewritten with small-scale data, but are based on a large number of heterogeneous data. It is apparent that conventional relational databases are unsuitable as data sources for modern presentation automatic generation systems. The content generation modes of most of the conventional presentation generation systems can be roughly divided into two types. One is to automatically generate the content of the presentation from beginning to end using a programming language; the other is to generate presentation content by means of placeholder substitution, simply by using templates. Because most popular document formats belong to private formats, the traditional briefing generation system realizes typesetting work of electronic documents by calling a third-party API, but typesetting effects are often unsatisfactory due to incomplete compatibility of the system or incomplete help documents.

Disclosure of Invention

In order to solve the problems, the invention provides a method for efficiently and automatically generating cross-platform heterogeneous data briefs by generating briefing contents by using a NoSQL database as a data source and by using a custom briefing format of an Office OpenXML standard.

The invention relates to a high-efficiency automatic generation method of cross-platform heterogeneous data profiling, which comprises the following steps,

(1) And processing mass heterogeneous data, wherein the SX404DB key-value type database is adopted for centralized management of data. The SX404DB key-value type database is a key-value NoSQL database based on an inverted index technology.

(2) And automatically generating the briefing content, and dynamically generating the briefing content through a docbumentscript control system.

(3) And the automatic typesetting of the presentation format is completed by injecting content into a format template based on Office OpenXML and compressing the content into a DOCX format document.

The method for efficiently and automatically generating the cross-platform heterogeneous data profile is further described as adopting the SX404DB key-value type database to centrally manage data, specifically, creating a database Session object to initiate a database Session to the SX404DB database, specifically, firstly creating an entity object, and then setting several attributes of the coding, the type, the region and the time of the entity object. And then calling a query method of the session object by taking the combination condition as a parameter to query all entity objects meeting the corresponding condition.

The efficient automatic generation method of the cross-platform heterogeneous data profile is further described as the method for centrally managing data by adopting the SX404DB key-value type database, specifically, the method is carried out by the following program packages contained in the SX404 DB: and converting formats among the data objects by adopting a program package connector. And a program package Directory is adopted to realize the index Directory management function. The package Index is adopted to realize the inquiry and modification of the Index. And adopting program packages Properties to realize the configuration of the database.

And (3) managing the database access Session by adopting a program package Session. And adopting a program package Condition and a conditional function in data operation. And adopting a program package Sort to realize the ordering function in the data query.

Wherein, the user accesses the resources in the database through the save method, delete method, update method and query method which are included in the Session class, the Searcher class, the processor class and the documetConverator class belong to the dependency relationship,

the Session class queries data through the Searcher class and modifies the data through the processor class. The data object format conversion is realized through the documetConverter class.

ConcurrentDirecty class is an aggregation relation with Searcher class and processor class, and objects of ConcurrentDirecty class are respectively displayed in Searcher class and processor class as an attribute.

The processor class provides the following methods for invocation: and logically deleting the data by using a delete method, and when the delete method is called, entering the operated data into a recovery area, and cleaning the recovery area by using a clearTrash method. The foredelete method is a physical deletion method and the data cannot be restored by physical deletion. The data was added using the insert method. The data is modified by the update method.

The whole SX404DB database only provides one ConcurrentDirectivity instance for each file path, realizes storage, inquiry, modification and deletion by operating the ConcurrentDirectivity instance, and the thread lock of the ConcurrentDirectivity is read-write separation, and all write operations are synchronously executed in a queuing mode.

The method for efficiently and automatically generating the cross-platform heterogeneous data profile is further described in the specification, wherein the SX404DB key-value type database is adopted for centralized management of data, specifically, all complete data objects are stored in a Document type unit, and each Document object comprises a plurality of filled members.

The above-mentioned method for efficiently and automatically generating cross-platform heterogeneous data briefs is further described as the method for dynamically generating briefing contents by a documetscript script control system, specifically comprising:

(1) And lexical analysis, namely segmenting the source code into a plurality of words.

(2) And analyzing grammar, namely clearing hierarchical logic relations among each word, and generating a plurality of abstract grammar trees.

(3) Generating machine language, generating machine language by using an interpretation language processor, interpreting and executing the abstract syntax tree one by using an interpreter, and feeding back the executing result.

The method for efficiently and automatically generating the cross-platform heterogeneous data profile is further described in that the lexical analysis is to realize that a Lexer word segmentation device splits a source code character string into Token, and the word segmentation function is completed through regular expression matching,

the Token is divided into a character string word part, a numerical word part, an identifier word part and a file ending symbol Shan Cilei, and the file ending symbol Token is embedded into the Token part in a single instance mode to be realized as a static member.

In the Lexer class, comPat, numPat, strPat, idPat string type fields are set, which are regular expressions of matching notes, numerical value Token, string Token and identifier Token respectively, and in the Lexer class, regexPat string type fields are also set for matching all legal strings in the DocumentScript with the expressions. When the Lexer class parsing process is performed, the segmenter will read the source code line by line, check each line of content one by one from left to right if it matches regxPat, and extract all matching strings.

The Lexer object obtains source code by accepting the Reader object.

In the process of the grammatical analysis after word segmentation, the Lexer class provides a peek method. The process of constructing the abstract syntax tree is a depth-first backtracking process, and when a construction error is found in the middle, a plurality of words need to be returned for reconstruction, specifically, a peek method and a buffer queue for storing temporary Token are provided. When the abstract syntax tree is constructed, firstly, the Token which is read later is obtained through the peer method and is stored in a buffer queue, then the content in the buffer area is judged, and finally, the Token is obtained through the read method to construct the abstract syntax tree.

In the Lexer specific implementation, the read method judges whether the buffer queue is in an empty state or not every time of reading, and adds a Token into the buffer queue when the buffer queue is in an empty state, returns the Token at the head of the buffer queue, and deletes the Token from the buffer queue. The peek method stores a plurality of pre-read Token in a buffer queue and returns the Token as a function return value without deleting any element in the buffer queue when the peek method is executed each time.

The method comprises the steps of generating a plurality of abstract syntax trees, specifically, assembling Token sequences into a tree structure from a simple linear structure according to the grammar rules of languages, wherein the abstract syntax trees are defined as interfaces named ASTRee, and a plurality of node classes for realizing the ASTRee interfaces are arranged.

The method for efficiently and automatically generating the cross-platform heterogeneous data profile further comprises the steps of interpreting and executing abstract syntax trees one by one through an interpreter and feeding back the execution result, specifically, evaluating each abstract syntax tree through the interpreter, traversing the whole abstract syntax tree from a root node to a leaf node in a recursive mode, wherein each accessed node has an evaluation return value, and the return values of other nodes except the leaf node depend on the return values of child nodes of the nodes.

The efficient automatic generation method of the cross-platform heterogeneous data profile further comprises the step that the word segmentation device reads source codes row by row, specifically, when a matched character string which does not comprise a front blank character is matched with the comPat, the character string is a comment, and when the character string is a numerical value type, character string type or identifier, the character string is matched with numPat, strPat or idPat. The Token of the determined type is stored in a Token queue to be returned, the rest part is processed by the same method continuously and repeatedly until the source code is finished, and the Lexer can split the source code into a Token queue.

The efficient automatic generation method for the cross-platform heterogeneous data profile is further described in the specification, wherein the node classes for realizing the ASTRee interface can be divided into leaf nodes and non-leaf nodes. The leaf nodes no longer contain child nodes and the non-leaf nodes may contain child nodes. The leaf nodes contain four classes: the Name class, the numberLiteral class, the StringLiteral class and the NullStmnt class are inherited to the leaf node class, the non-leaf nodes are inherited to the non-leaf node class, and the non-leaf nodes can be divided into three classes: the first class is used as flow control for a program, and includes a sequential structure class, a branching structure class used as processing of expressions, and a control loop structure class used as a function.

The method for efficiently and automatically generating the cross-platform heterogeneous data profile is further described as the method for automatically generating the cross-platform heterogeneous data profile, wherein the Lexer object acquires source codes by receiving a Reader object, and the method specifically comprises a read method and a peer method, the read method acquires Token one by one from the head of the source codes, and a new Token is returned when the read method is called each time. The peer method is used for pre-reading the Token, and the peer (i) returns the ith Token after the Token which is about to be returned by the read method. After the source code is read, the read method and the peek method will return token.

Compared with the prior briefing generation method, the method has the characteristics of supporting massive heterogeneous data, flexible and extensible content generation mode, stable briefing format and high compatibility. The system has good stability, operability and expansibility.

The NoSQL database SX404DB based on the inverted index is a key-value NoSQL database, which adopts a star model to describe management data based on the Lucene index, and with the help of the key-value NoSQL database, developers do not need to pay labor for complicated table construction work and database compatibility problems. Because there is no table at all in SX404DB, all data is managed flat, and there is neither hierarchical nor constraint relationship between data. The database can realize easy management of structured, semi-structured and unstructured data and even support direct access of Java objects.

The invention realizes a content injection method of DOCX documents with fixed formats.

In the process of bulletin writing, the format typesetting is time-consuming and subject to much work, and the part of work can be completely and automatically processed by a computer. The DOCX format is used as a standard format of Microsoft Office documents, has extremely high market share in the field of Office documents, and the document content injection method realized in the document content injection method is realized aiming at the format. In the processing process of the method, the unification of the document format can be easily realized by only injecting corresponding content into the template according to the Office OpenXML standard. The method can help the writer of the briefing save a great deal of time for typesetting.

The invention designs a script language docurmentscript for automatically generating document content, and completes the development of a processing platform of the language under Java environment.

The automatic generation of the content of the presentation often adopts a label replacement mode, and the text adopts a script control mode. The document content generated by the method can be more flexible, not only can the writing of sequential documents be supported, but also the conditional generation or the cyclic generation of the content can be realized. The documetscript is a scripting language, i.e. an interpreted language that can be executed without compiling for a specific platform, and by which control of content generation has better cross-platform and extensibility.

Description of the embodiments

The SX404DB of the invention is a key-value NoSQL database developed by Java language and based on the inverted index technology. The method is based on the inverted index established by Lucene, realizes an ORM mechanism through a dynamic proxy technology, and finally realizes the data storage and access functions taking Java objects as units.

All database operations in SX404DB are based on one database Session, so a database Session object, i.e., session object, is created before accessing the database. The system can easily access and manage the whole SX404DB database through one Session object

Examples of data logging: an entity object of a total number of tobacco sales class (the class can be any user-defined class conforming to the JavaBean specification) is created firstly, then the values of a id, type, province attribute and a time attribute of the object are set to be 1, chengdu tobacco and 2016-03-01, and finally a save method of a session object is called to finish the storing of the object.

Examples of data queries: firstly, creating a multi condition object, then creating two simple conditions (the provice value is "success" and the time value is "2016-03-01"), adding the two simple conditions into the previously created combined condition in the form of a necessary condition (the condition is the necessary condition in the combined query), and finally, calling the query method of the session object by taking the combined condition as a parameter to query all tobacco sales total objects meeting the corresponding condition.

Examples of data modification: firstly, a simple condition with an id value of 1 is created, then a modification item with a precursor value of Hunan is created, and finally, the update method of the session object is called by taking the previously created query condition and modification phase as parameters to change the precursor value of all tobacco sales total objects with the id value of 1 in the database into Hunan.

Examples of data deletion: firstly, a simple condition with an id value of 1 is created, and then a delete method of the session object is called by taking the simple condition as a parameter to delete all tobacco sales total objects with the id value of 1 in the database.

The SX404DB contains 7 packages, each of which is responsible for different functional tasks, as shown in the following table. The director package, the index package and the properties package are the bottom-oriented 4 packages and are respectively responsible for a format conversion function, an index catalog management function, an index query modification function and a basic configuration function of the database. The session package, the condition package and the sort package are 3 application layer oriented packages, which are respectively responsible for the access function of the database, the condition function in the data operation and the ordering function in the data query. The classes in these three packages are mainly provided for third party applications to access the database.

Program package	Description of the invention
		cn.edu.cug.sx404.database.condition	Implementing management of query, modification, and deletion conditions
cn.edu.cug.sx404.database.convertor	Implementing conversion of formats between data objects
		cn.edu.cug.sx404.database.directory	Implementing index directory management
cn.edu.cug.sx404.database.index	Implementing queries and modifications to an index
		cn.edu.cug.sx404.database.properties	Implementing configuration of databases
cn.edu.cug.sx404.database.session	Implementing management of database access sessions
		cn.edu.cug.sx404.database.sort	Implementing ordering functions in data queries

Model relationship of SX404DB major functional class: there are several reloaded save, delete, update, and query methods in the Session class, through which a user can access resources in the database. Session class and Searcher class, processor class, and docmentConverter class belong to a dependency relationship, that is, the inquiry and modification operations of Session class on data are implemented by Searcher class and processor class respectively, and the data object format conversion function is implemented by docmentConverter class. The ConcurrentDirecty class is an aggregation relation with the Searcher class and the processor class, in other words, the objects of the ConcurrentDirecty class appear as an attribute in the Searcher class and the processor class respectively.

The addition, deletion, modification and checking of the data are the most basic functions of the database, and the basic operation functions of the underlying data of the SX404DB are developed based on the inverted index technology of the Lucene frame, wherein in the inverted index frame, all complete data objects are stored in a Document class unit. The basic storage unit of the database is a Java bean (simply referred to as Java entity class conforming to the object-oriented design principle), but the database index truly stores a Document object which is not a Java bean object but can reflect the characteristics of the Java bean object. Each Document object contains several Filed members, and it is by these Field members that Document objects reflect the characteristics of JavaBean objects.

The index created by Lucene from a logical structure point of view consists of Segment, document, field, term. A Lucene index is made up of a series of files having different filename prefixes or suffixes, whose specific functional descriptions are shown below as functional descriptions of subfiles in the table index file:

in SX404DB, access to the underlying Document is achieved through a search class that provides data querying functions and a processor class that provides data adding, modifying, and deleting functions.

The Searcher class provides 3 methods that can be invoked, an outline of which is shown in the table below. The getInstance method is a static method for creating an instance of the Searcher class. The search method is used for inquiring data, and when the method is executed, the method returns a Document sequence according to the designated inquiry condition and the ordering rule. The function of the getdocumntbydocid method is to query the Document object for an ID. The following table:

qualifier and type	Methods and description
		Document	getDocumentByDocID (int docID) returns a Document object specifying the docID
static Searcher	getInstance (ConcurrentDirectory concurrentDirectory) A Searcher instance
		List< Document>	search (org. Apache. Solution. Search. Query, int begin, int max, sort Sort) searches for Document objects satisfying the condition

The processor class provides 8 methods for invocation, an outline of which is shown in the table below. The getInstance method is a static method for creating instances of the processor class. The delete method is used for logical deletion of data, when the method is called, the operated data will enter the reclamation area, and the reclamation area is cleaned up by the cleartrade method. The foredelete method is a physical delete method that cannot be restored once the data is physically deleted. The insert method and the update method are used for adding and modifying data, respectively. Processor class method summary table:

qualifier and type	Methods and description
		boolean	cleartrade () clears the recovery area
boolean	close () closes the current processor object
		boolean	delete (Query query) logical delete data
boolean	forceDelete (Query query) physical delete data
		static Processer	getInstance (ConcurrentDirectory concurrentDirectory) returns a processor instance
boolean	insert (Document doc) adding Document object
		boolean	insert(List< Document>list) adding Document objects
<T> boolean	update(Query query, Class<T> c, Term[]terms) modifies Document objects

Searcher and processor accesses to underlying data are also actually three index access classes that invoke Lucene native: indexWriter, indexReader and IndexSearcher. These three classes can be classified into a read operation class and a write operation class according to the nature of the indexing operation. The IndexReader and the IndexSearcher belong to the read operation class of the index, the IndexReader can extract corresponding documents from the index according to the unique ID, and the IndexSearcher can inquire the ID set of the corresponding documents according to the designated inquiry condition. IndexWriter belongs to the class of index write operations that can provide a caller with the ability to write an index, which can be used to reorganize an index and write the index to, for example, disk or memory when a user needs to add data, delete data, or modify data.

Encapsulation and conversion of data objects

The SX404DB bottom layer stores a large number of isolated key-value pairs, and the Lucene framework only provides access functions of Document objects, so that to implement the direct access function of a java bean object, the conversion function between the Document object and the java bean object must be implemented first.

The conversion function of the SX404DB data object is mainly realized by a documetConverator class, and the realization of the class uses the design idea of ORM. The ORM is an abbreviation of Object Relational Mapping here, representing an object relationship map. The documetConverter provides several types of conversion methods, including overt 2Document, convert Obj, overt 2Query, castToQuery, castToField, and cast. cast performs the conversion work between the basic data type (e.g., int, long, double, float, etc.) and String type. casttoservy and casttoseld accomplish the conversion of key value pairs into Query and fields in Lucene. The conversion work between JavaBean and Document is completed by the converter 2Document and the converter 2 Obj. The conversion work from JavaBean to Query is completed by the overt 2 Query.

In order to improve the efficiency of the whole database, the SX404DB adopts a Lazy data loading mode, that is, the data in the entity object is accessed and loaded instantly. Such a data loading mode may also be referred to as a dynamic data loading mode, the implementation of which is often achieved by dynamic proxy technology. A set of dynamic proxy mechanism is provided for facilitating the implementation of the proxy mode Java language itself. The implementation of this dynamic proxy mechanism must ensure that both the proxy class and the implementation class implement the same interface. This dynamic proxy implementation brings much extra effort to the application layer design and once the interface is designed, it is difficult to extend it later. A more flexible dynamic data loading mode is realized for the SX404DB through a third party class library CGLIB. In this mode, all physical objects obtained by database queries are database processed, the attributes in the objects are null before they are accessed, the data loading mechanism is dynamically invoked only when they are accessed, and these objects do not need to implement any predefined interfaces. The CGLIB is a high-performance byte code generation class library, so that the class which is defined can be modified from the byte code layer, and a dynamic proxy mechanism irrelevant to any interface is realized.

Processing of thread safety problems:

the framework design of Lucene initially takes into account thread security issues in a distributed environment, so that a robust lock mechanism exists inside Lucene itself. However, the lock in Lucene is a fine-grained lock, and only normal operation of data read-write operation can be guaranteed. In SX404DB, the basic logical unit of storage is not a key-value pair, but a JavaBean object, and for such storage of large-granularity objects, the basic Lucene lock mechanism cannot guarantee that dirty reading does not occur.

To avoid conflicts between threads, java provides two thread synchronization mechanisms. The first is to wrap up the code blocks that need to be synchronized with synchronized keys, and the second is to achieve thread synchronization through thread locks. The SX404DB uses the thread lock mechanism provided by the Java language itself to design a set of thread-safe directory management mechanisms. The whole database only provides one ConcurrentDirecty instance for each file path, and all the functions of storing, inquiring, modifying and deleting are realized by operating the ConcurrentDirecty instance. In order to improve the performance of the database, a read-write separation mode is adopted by the thread lock of ConcurrentDirecty. In this read-write separated thread lock mode, all read operations can be performed asynchronously, but all write operations must be performed synchronously in a queued manner. Thus, not only is thread safety ensured under the multithreading environment ensured, but also unnecessary queuing data reading operation is avoided, thereby greatly improving the performance of the database.

The most time-consuming task can be easily found in the process of writing many briefs, namely data statistics and format typesetting, and in order to simplify the task, a method for embedding content into a DOCX format document is proposed, and the method mainly comprises three steps.

1. And (5) formulating an OOXML template.

2. Information is extracted from Java entity class carrying information and is injected into the template.

3. The templates containing the information are compressed into documents that can be viewed and published.

The templates in the first step are written in the OOXML specification. One of the simplest OOXML documents consists of three parts: a relationship mapping section, a content type definition section, and a main body content section. The most important part of the body content is recorded in the "/document. Xml" file in the WordprocessingML language. The WordprocessingML language is a markup language that complies with the XML language specification. All tags defined by the language begin with w (e.g., < w: document >), most tags appear in pairs, and individual tags appear as single tags. The whole document content is filled in the < w: document > </w: document > which contains a pair of < w: body > </w: body > tags, and the < w: body > </w: body > contains a plurality of pairs of < w: p > </w: p > tags. Each pair of < w:p > </w:p > represents a paragraph, and a plurality of pairs of < w:r > </w:r > labels can be contained in the middle, and a < w: pPr/> label used for describing paragraph patterns can be added in the middle. Each pair of < w:r > </w:r > represents a string of consecutive characters, and may include a < w: rPr/> tag for describing a string pattern and a pair of < w:t > </w:t > tags for storing strings.

The job after the template design is completed is to inject document content into the template, and the class used for template content injection is defined herein as TempleteUtil. The template content injection function is realized by calling the insert method of TemplateUtil, and the realization process of the function has three steps. The first step is to read all information in the template into the memory in the form of character strings. And secondly, searching the label from the character string loaded with the template. This tag lookup process may be accomplished by regular expression matching, the expression "\\ $\ \ { [ +\ \)" A } || { a \ \ $ } + ] ", a }", all strings in the form "$.}" can be matched. And thirdly, replacing all the labels in the character string loaded with the template with document contents mapped by the labels, and rewriting the character string which is completed to be replaced into the template file.

The last procedure of the embedding work of the whole DOCX format document content is the packaging of the OOXML file. In the OOXML format standard, documents in DOCX format are compressed in Unicode and ZIP formats according to the OPC convention. The implementation of the ZIP compression function is written herein in the ZIP compression class. Logically, a DOCX document is an OPC package, which is a complete set of parts. Each part is defined by a case-insensitive pathname, which is a string shaped as "/pres/slides/slide1.Xml" slash left "/" split name; and each portion has its particular content type. In terms of physical structure, the ZIP file encapsulated by the OPC convention is an OPC package, and each ZIP file item corresponds to a part of the package, and the path name also corresponds to the path name of the part of the package. In this OPC package, "/[ content_types ]. Xml" is used to define the Content type of each section. There is also a clear mapping between each part in the package. The series of mappings are stored in the relationships section. All mapping relationship parts are named in the form of '…/_rels/… rels'; if the path name of a part is "/a/b/c.xml", then its mapping relationship path name is "/a/b/_rels/c.xml.rels". The most important document content in the whole package is recorded in the document part, where the main content of the document is recorded with the document "/document. Xml".

A set of design principles and implementation procedures of script language docurmentscript designed and developed for document content automatic generation function.

Processing procedure of language processor

The processing procedure of the language processor goes through three basic steps: lexical analysis, grammatical analysis, execution (or generation of machine language). Lexical analysis may also be referred to as word segmentation, i.e., the segmentation of a source code into a number of words (Token); next, parsing is to parse through hierarchical logic relationships between each word, and generate a plurality of abstract syntax trees (AST, abstact Syntax Tree). Executing (or generating machine language) through the interpreter to interpret and execute the abstract syntax tree one by one, and finally feeding back the execution result.

Word segmentation device design

The first step in implementing a language processor is to implement a word segmentation unit (Lexer). An unprocessed program source code can be seen as a long string of short strings. The word segmentation device is used for splitting the long character string of the source code into Token.

Token of the documetscript language can be divided into four classes: string words, numeric words, identifier words, and end-of-file words. String words and numerical words are well understood, i.e., sequences of characters representing strings and numerical values. But with the addition of a sequence of quotation marks (") like" 123", even though the middle part of the quotation marks may represent a number, not a numeric word, but a string word. Identifier words are some keywords, brackets "{ }, brackets" [ ], brackets "()", and semicolons "used in the program; "and variable names, etc. In addition to the three Token types that have practical significance, a special end of file (EOF) word is defined herein in docmentscript to identify the end of the code file.

The word is defined herein as an abstract class Token whose fields and method designs are shown in the following tables, respectively. Where EOF represents Token of the end of file, and EOL is a line feed character defined in DocumentScript. isNumber, isIdentifier, isString are words used to determine whether they are numeric, string or identifier types, by which the particular type of Token can be determined. The getText method and the getNumber method function to return the values and strings in the Token object. In Token classes, other methods than the getLineNumber method are abstract methods, which are implemented in its subclasses one by one.

Token type field summary table:

qualifier and type	Fields and description
		staticToken	EOF file terminator Token
staticjava.lang.String	EOL line feed symbol

Token class method summary table:

based on the parent Token, three subclasses are defined herein: strToken, numToken, idToken. Which represent string words, numeric words, and identifier words, respectively. An end of File (EOF) Token is not implemented as a subclass of Token on an external file, but is implemented as a static member embedded in a Token class in a single instance mode, because the function is simply to identify the end of a code file and the structure is simple.

A complete piece of program code can be split into the four Token sequences, and the splitting work is completed by a word segmentation device (lexical analyzer). The segmenter is defined herein as a Lexer class, whose implementation of the segmentation function is accomplished through regular expression matching.

In the Lexer class, there are five string type fields. comPat, numPat, strPat, idPat are regular expressions for matching notes, numeric Token, string Token, identifier Token, respectively, while regexPat expressions can match all legal strings in the documetScript. When the Lexer class parsing process is performed, the segmenter will read the source code line by line, check each line of content one by one from left to right if it matches regxPat, and extract all matching strings. If the matching string (excluding the front blank) matches the comPat, it is interpreted that the string is a comment. If the string is a numeric font size, string font size, or identifier, it will match numPat, strPat or idPat. The Token of the determined type is stored in the Token queue to be returned. And then processing the rest part by using the same method, and repeating the steps until the source code is finished, wherein the Lexer can split the source code into a Token queue. In the Lexer class herein, such a process is mainly implemented by the readLine method and the addoken method.

Lexer class field summary table

The Lexer-like method is outlined in the following table. The Lexer class has a construction method that includes a Reader type parameter, and the Lexer object obtains source code by accepting the Reader object. Two methods, read and peek, are also defined in the Lexer class, and the process of lexical analysis is driven by these two methods. The read method may obtain Token one by one from the header of the source code, and return a new Token each time it is called. The peek method is used to pre-read Token, and peek (i) will return the ith Token after the Token that the read method is about to return. If the source code is read, both the read method and the peek method will return token.

Lexer class method summary table:

if the word segmentation is simple, the Lexer class provides a read method which can be fully realized. However, in the process of parsing after word segmentation, the Lexer class needs to provide a peek method additionally. Syntax analysis is a process of constructing an abstract syntax tree while retrieving Token. The process of constructing the grammar tree is a depth-first backtracking process, and when the construction errors are found in the middle, a plurality of words need to be returned for reconstruction. In order to support the backtracking on the process, the method comprises the following steps:

A peek method and a buffer queue for storing temporary Token are provided. When the abstract syntax tree is constructed, firstly, the Token which is read later is obtained through the peer method and is stored in a buffer queue, then the content in the buffer area is judged, and finally, the Token is obtained through the read method to construct the abstract syntax tree.

Constructing an abstract syntax tree method performance comparison table:

in the Lexer specific implementation, the read method judges whether the buffer queue is in an empty state or not every time of reading, if so, a Token is added into the buffer queue, and finally, the Token at the head of the buffer queue is returned and deleted from the buffer queue. But the peek method stores the pre-read Token in the buffer queue and returns it as a function return value each time it is executed, without deleting any element in the buffer queue. The operations of adding Token to the buffer queue by the read method and the peak method are realized by relying on the filequeue method. The filequeue method has an int-type parameter and a buffer-type return value, the parameter of which indicates the number of read-in Token, and the return value indicates whether the buffer queue is successfully filled. In general, the return value of the method is true, and only if the program code is completely read, false is returned.

Design of parser

The work after the lexical analysis is completed is to construct an abstract syntax tree, namely, the Token sequence is assembled into a tree structure from a simple linear structure according to the grammar rule of language in the process of syntax analysis, and the abstract syntax tree is constructed in short. An abstract syntax tree is defined as an interface named astre, the method of which is outlined in the table below.

ASTRee interface method summary table:

ASTRee defines only one interface of the abstract syntax tree rather than a specific class, so for fully describing one abstract syntax tree, a great deal of node classes which realize the ASTRee interface are also designed, and the functional description of the node classes is shown in the following table.

Abstract syntax tree node type specification:

abstract syntax tree node type	Description of the invention
		ASTLeaf	Leaf nodes of abstract syntax tree
ASTList	Non-leaf nodes of abstract syntax tree
		NumberLiteral	Numerical literal quantity
StringLiteral	Character string literal quantity
		Name	User-defined variables
BinaryExpr	Binary operation expression
		PrimaryExpr	Basic expression
NegativeExpr	Negative value expression
		NegateExpr	Negative expression
BlockStmnt	Code block
		IfStmnt	If expression
WhileStmnt	While expression
		NullStmnt	Empty space
PostFix	Suffix(s)
		DefStmnt	Function definition expression
Fun	Function of
		ParameterList	Sequence of shape parameters
Arguments	Real ginseng sequence

The implementation node classes of the abstract syntax tree may be divided into leaf nodes and non-leaf nodes, which are defined herein as astm leaf classes and ASTList classes, respectively. As the name suggests, a leaf node no longer contains children, and a non-leaf node may contain children. The leaf nodes contain four classes: name class, numberl class, stringl class, and numllstmnt class. They are all inherited by the astm leaf class. Non-leaf nodes are inherited by the playlist class. These non-leaf nodes can all be divided into three categories: the first class is used as flow control for programs, and includes the BlockStmnt class, the IfStmnt class, and the whitestmnt class, which represent sequential, branching, and looping structures, respectively; the second type is used as an expression process; the third class is used as control of the function.

Design of interpreter:

the work after the completion of the parsing is the program execution work of the interpreter. After the abstract syntax tree is constructed, the execution of the program is simpler, and an interpreter only needs to evaluate each abstract syntax tree. This method of evaluation recursively traverses the entire abstract syntax tree from the root node to the leaf nodes. Each accessed node will have an evaluation return value, while the return values of other nodes, except for leaf nodes, will depend on the return values of its child nodes.

If the evaluation is to be performed according to the abstract syntax tree, the class corresponding to each node object of the abstract syntax tree must be provided with an evaluation method. This evaluation method eval is defined herein in ASTRee in a definition form such as public abstract Object eval (Environment env), which must be implemented by all subclasses inherited to ASTRee. Therefore, only the eval method of the root node object of the abstract syntax tree is called, the program corresponding to the syntax tree can be completely executed.

The docbumentscript language herein is a scripting language that supports variable definition, so that the scope of the variable is involved, and thus the environment object is passed to eval methods when executed. Briefly, an environment object is a data structure for recording the correspondence between variable names and variable values, which is defined herein as an event interface. When a program adds a new variable, a key value pair consisting of the name and initial value of the variable will be added to the current environment object, after which the program will fetch the variable value from the environment object if the variable is used again. If a new value is to be given again to the same object, the definition domain of the variable is found first, and the variable value is updated for the environment object of the designated definition domain.

The implementation of the Evironent interface is done by the BasicEnv class. In BasicEnv class, the value object is a HashMap, which is used to complete the storage of key value pairs; the outer object is a parent environment of the current environment, and the child environment and the parent environment are inherited relations, namely, the child environment can access the variables defined in the parent environment, and the parent environment cannot access the variables defined in the child environment. The design thought of the variable assignment and value function is very simple, and the variable assignment and value function is realized by a put and get method. It should be noted that the put method and the putNew method are distinguished. The putNew method is to directly add or modify variables in the current environment object. The put method is to judge the definition domain of the variable to be operated, if the variable is defined in the parent environment, the variable value in the parent environment is modified, and if the variable is defined in the current environment or not defined in any environment, the put new method is called in the current environment to add or modify the variable. The function of searching the variable definition domain is completed by a sphere method, the execution process of the function is that whether the variable is defined in the current environment is judged, if the variable is not defined in the current environment, the sphere method in the parent environment is continuously recursively called to judge whether the variable is defined in the parent environment, and if the variable is not defined in the current environment and the current environment does not have the parent environment, null is returned.

It was also mentioned above that ASTRee is an abstract class whose internal eval method is not implemented, so that a specific evaluation method is implemented in a subclass.

Among the numerous subclasses of astm ee, the Name class, the numberl class, the stringl class, and the numstmnt class belong to the leaf node class, i.e., the subclass of astm leaf. Because Name represents a user-defined variable, its eval method performs a process of taking a value from an environment object, and throwing an exception if the definition of the variable does not exist in the environment object. The numberLiteral class and the StringLiteral class belong to the node class of the literal quantity type, so the eval method only returns the basic value in the Token, the implementation process is relatively simple, and the details are not repeated. The NullStmnt class represents an empty statement, no value is returned, so there is no specific implementation of eval, but instead the eval method of ASTLeaf is directly inherited, and once its eval method is called, an exception is thrown.

In order to realize the sequence, branch and cyclic flow control functions of the program, all the flow control classes in the non-leaf node class also realize the eval method.

In the eval method execution process of the IfStmnt class, firstly, executing an eval method of a judging condition (condition) of the current object, if the return value is true, executing an eval method of a positive code block (thesblock), and if the return value is false and the current IfStmnt object contains a negative code block (elseBlock), executing the eval method of the negative code block. The judgment conditions, positive code blocks and negative code blocks are obtained by the condition method, the thenBlock method and the elseBlock method, respectively, and the return values of the methods are obtained by calling child (0), child (1) and child (2), respectively.

The eval method of the Whilestmnt class is implemented by a loop, the loop condition (condition) of the eval method is evaluated first every time, if the return value is true, the eval method of the loop body (body) is implemented, and otherwise, the loop is jumped out. The circulation condition and the circulation body are obtained by calling the condition method and the body method respectively.

The BlockStmnt class only represents a code block, so that the evaluation process is also only to sequentially execute the eval methods of all the nodes contained therein, and no other complex logic exists, so that the specific implementation of the eval methods is not described herein.

It is well known that an expression must have its return value, so all expression processing classes in the non-leaf node class implement its eval method. The definition range of the expression itself is relatively wide, and the expression can be a common literal quantity or an operation expression. The evaluation methods of the numeric type literal quantity and the string type literal quantity have been described above and will not be repeated here, and the eval method implementation of the operator expression will be described below.

The operational expressions can be classified into a monocular operational expression and a binocular operational expression. In the monocular operation expression, the documetScript only realizes the evaluation methods of the negative expression and the negative expression, the realization methods are simpler, and the original numerical value is only required to be inverted or negatively operated and then returned, so that the realization is not repeated. However, it should be noted that, before performing the monocular operation, the type of the object to be operated needs to be verified, the negative operation can only be operated on the numerical object, and the non-operation can only be operated on the integer object.

The binocular operator is defined herein as the BinaryExpr class. In the BinaryExpr class, the left, right, and operator methods function to return left, right, and operator, respectively. When the eval method is executed, the program firstly judges whether the current binocular operation expression is of an assigned type or a calculated type, if the current binocular operation expression is of the assigned type, the computer assign method is called, and if the current binocular operation expression is of the calculated type, the computer op method is called. The execution process of the computeOp method is to judge the types of left and right operands first, if the type is character string type, the operation is directly processed by the current method, and if the type is numerical type, the operation is processed by the computeNumber method.

The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made thereto fall within the spirit of the invention and the scope of the claims.

Claims

1. A high-efficiency automatic generation method of cross-platform heterogeneous data profile is characterized by comprising the following steps,

(1) Processing massive heterogeneous data, wherein the data is centrally managed by adopting an SX404DB key-value type database; the SX404DB key-value type database is a key-value type NoSQL database based on an inverted index technology;

(2) The automatic generation of the briefing content, the dynamic generation of the briefing content by the documetScript script control system, specifically comprises the following steps:

lexical analysis, namely segmenting a source code into a plurality of words;

syntax analysis, namely clearing hierarchical logic relations among each word, and generating a plurality of abstract syntax trees;

generating a machine language, generating the machine language by using an interpretation type language processor, interpreting and executing the abstract syntax tree one by using an interpreter, and feeding back the execution result;

(3) The automatic typesetting of the presentation format is completed by injecting content into a format template based on Office OpenXML and compressing the content into a DOCX format document.

2. The method for efficiently and automatically generating the cross-platform heterogeneous data profile according to claim 1, wherein the centralized management data of the SX404DB key-type database is adopted, specifically, a database Session object is created to initiate a database Session to the SX404DB database, specifically, an entity object is created first, and then several attributes of the coding, the type, the region and the time of the entity object are set; and then calling a query method of the session object by taking the combination condition as a parameter to query all entity objects meeting the corresponding condition.

3. The efficient automatic generation method of cross-platform heterogeneous data profiling according to claim 1, wherein the centralized management of data by using SX404DB key-value type databases is specifically performed by the following packages contained in SX404 DB:

adopting a program package connector to realize format conversion between data objects;

adopting a program package Directory to realize an index Directory management function;

adopting a program package Index to realize the inquiry and modification of the Index;

adopting program packages Properties to realize configuration of a database;

adopting a program package Session to realize the management of a database access Session;

adopting a program package Condition and a conditional function in data operation;

adopting a program package Sort to realize the ordering function in data query;

the method comprises the steps that a user accesses resources in a database through a Session class comprising a reloaded save method, a reloaded delete method, an reloaded update method and a reloaded query method, wherein the Session class and a Searcher class, a processor class and a documetConverator class belong to a dependency relationship;

the Session class inquires data through the Searcher class and modifies the data through the processor class; realizing data object format conversion through a documetConverter class;

ConcurrentDirector class and Searcher class, processor class are aggregation relation; the object of the ConcurrentDirecty class is used as an attribute to be respectively displayed in the Searcher class and the processor class;

the processor class provides the following methods for invocation: logically deleting the data by a delete method, when the delete method is called, enabling the operated data to enter a recovery area, and cleaning the recovery area by a clearTrash method; the foredelete method is a physical deletion method, and data cannot be recovered when the data is physically deleted; adding data by an insert method; modifying the data by using an update method;

the whole SX404DB database only provides one ConcurrentDirectory instance for each file path, and realizes storage, inquiry, modification and deletion by operating the ConcurrentDirectory instance; the thread lock of ConcurrentDirecty is read-write separation, and all write operations are synchronously executed in a queuing mode.

4. The method for efficiently and automatically generating the cross-platform heterogeneous data profile according to claim 1, wherein the SX404DB key-type database is adopted for centralized management of data, specifically, all complete data objects are stored in a Document class unit, and each Document object comprises a plurality of file members.

5. The efficient automatic generation method of cross-platform heterogeneous data profile according to claim 1, wherein the lexical analysis is to split a source code character string into Token by a Lexer class word splitter, and complete word splitting function by regular expression matching,

the Token is divided into character string word class, numerical word class, identifier word class and file ending character Shan Cilei, and the file ending character Token is embedded into the Token class in a single instance mode to be used as a static member to be realized;

comPat, numPat, strPat, idPat string type fields are arranged in the Lexer class and are regular expressions for matching notes, numerical value Token, string Token and identifier Token respectively, and regexPat string type fields are also arranged in the Lexer class and are used for matching all legal strings in the DocumentScript with the expressions; when the grammar analysis process of the Lexer class is executed, the word segmentation device reads source codes line by line, checks whether each line of content is matched with regexPat one by one from left to right, and extracts all matched character strings;

the Lexer object obtains source code by accepting the Reader object;

in the process of the grammatical analysis after word segmentation, the Lexer class provides a peek method; the process of constructing the abstract syntax tree is a depth-first backtracking process, and when a construction error is found in the middle, a plurality of words need to be returned for reconstruction, specifically, a peek method and a buffer queue for storing temporary Token are provided; when the abstract syntax tree is constructed, firstly acquiring a Token which is read later through a peer method, storing the Token into a buffer queue, judging the content in the buffer area, and finally acquiring the Token through the read method to construct the abstract syntax tree;

In the concrete implementation of the Lexer class, the read method judges whether the buffer queue is in an empty state or not every time of reading, if so, a Token is added into the buffer queue, then the Token at the head of the buffer queue is returned, and the Token is deleted from the buffer queue; storing a plurality of pre-read Token in a buffer queue when the peek method is executed each time, and returning the Token as a function return value without deleting any element in the buffer queue;

6. The method for efficiently and automatically generating the cross-platform heterogeneous data profile according to claim 1, wherein the abstract syntax trees are interpreted and executed one by an interpreter, and the execution result is fed back, specifically, each abstract syntax tree is evaluated by the interpreter, the method of evaluating recursively traverses the whole abstract syntax tree from a root node to a leaf node, each accessed node has an evaluation return value, and the return values of other nodes except the leaf node depend on the return values of child nodes.

7. The method for efficiently and automatically generating cross-platform heterogeneous data profiles according to claim 5, wherein said word segmentation means reads source codes line by line, specifically, when a matched character string excluding a front blank is matched with a comPat, the character string is a comment, and when the character string is a numeric type, character string type or identifier, the character string is matched with numPat, strPat or idPat; the Token of the determined type is stored in a Token queue to be returned, the rest part is processed by the same method continuously and repeatedly until the source code is finished, and the Lexer can split the source code into a Token queue.

8. The efficient and automatic generation method of cross-platform heterogeneous data profile according to claim 5, wherein the plurality of node classes for realizing an ASTRee interface can be divided into leaf nodes and non-leaf nodes; the leaf nodes no longer contain child nodes, and the non-leaf nodes may contain child nodes; the leaf nodes contain four classes: the Name class, the numberLiteral class, the StringLiteral class and the NullStmnt class are inherited to the leaf node class, the non-leaf nodes are inherited to the non-leaf node class, and the non-leaf nodes can be divided into three classes: the first class is used as flow control for a program, and includes a sequential structure class, a branching structure class used as processing of expressions, and a control loop structure class used as a function.

9. The method for efficiently and automatically generating the cross-platform heterogeneous data profile according to claim 5, wherein the Lexer object acquires the source code by accepting the Reader object, and specifically comprises a read method and a peer method, wherein the read method acquires tokens one by one from the head of the source code, and returns a new Token when each time the Token is called; the peer method is used for pre-reading the Token, and the peer (i) returns the ith Token after the Token to be returned by the read method; after the source code is read, the read method and the peek method will return token.