WO2024019617A2

WO2024019617A2 - Flowsheet digitization with computer vision, automatic simulation, and flowsheet (auto)completion with machine learning

Info

Publication number: WO2024019617A2
Application number: PCT/NL2023/050385
Authority: WO
Inventors: Artur Maria SCHWEIDTMANN
Original assignee: Technische Universiteit Delft
Priority date: 2022-07-18
Filing date: 2023-07-17
Publication date: 2024-01-25
Also published as: WO2024019617A3; NL2032523B1

Abstract

The present invention is in the field of physical processes, chemical processes, biologi- cal processes, and microbiological processes in general, apparatuses for such processes, such as for boiling, for separation, for mixing, for dissolving, for reacting, for controlling, and in particular a process comprising a plurality of such apparatuses and processes or process steps, as well as the interaction between said apparatuses and processes or process steps, such as in terms of flows of chemicals between apparatuses. To indicate such general flow aspects a process flow diagram may be used. The process flow diagram displays the relation- ship between major equipment of a plant facility and does not show minor details.

Description

Flowsheet digitization with computer vision, automatic simulation, and flowsheet (auto)completion with machine learning

FIELD OF THE INVENTION

The present invention is in the field of physical processes, chemical processes, biological processes, and microbiological processes in general, apparatuses for such processes, such as for boiling, for separation, for mixing, for dissolving, for reacting, for controlling, and in particular a process comprising a plurality of such apparatuses and processes or process steps, as well as the interaction between said apparatuses and processes or process steps, such as in terms of flows of chemicals between apparatuses. To indicate such general flow aspects a process flow diagram may be used. The process flow diagram is aimed to visually display the relationship between major equipment of a plant facility and does not show minor details.

BACKGROUND OF THE INVENTION

In the representation of physical processes, chemical processes, biological processes, and microbiological processes in general, apparatuses for such processes, and the interaction between said apparatuses and processes or process steps, process flow diagrams may be used. A first step leading to a construction of a process plant and its use in the manufacture of a product is typically the conception of a process, typically involving process steps. The process concept may then be visualized by a process flow diagram, representing the process steps, and main details thereof, or likewise, a method of producing. Process design can then then proceed on the basis of the process flow diagram chosen. Therein also physical properties of the apparatuses are incorporated. Fig. 1 shows some typical elements and symbols used. The elements of such flow diagrams, as well as aspects thereof, such as implementation, typically comply with one or more of the following standard: ISO 15519-l:2010(en): Specification for diagrams for process industry — Part 1: General rules; ISO 15519- 2:2015(en): Specifications for diagrams for process industry — Part 2: Measurement and control; ISO 10628- l:2014(en): Diagrams for the chemical and petrochemical industry — Part 1: Specification of diagrams; ISO 10628-2:2012(en): Diagrams for the chemical and petrochemical industry — Part 2: Graphical symbols; ANSI Y32.l l: Graphical Symbols For Process Flow Diagrams (withdrawn 2003); and SAA AS 1109: Graphical Symbols For Process Flow Diagrams For The Food Industry. These process flow diagrams may be used to perform steady-state and non- steady- state heat and mass balancing, sizing and costing calculations, such as for a chemical process. It is considered an essential and core part of process design. Therein nowadays a computer or the like is used, in particular for supporting the calculations, and hence process design. Typical steps in process design are an initial step, which may be referred to as synthesis, a step for optimizing the process design, which may involve heat and material balance, sizing of process equipment, and cost calculations, and a control step for assessing topics as safety, operability, and a final step, wherein the process design or parts thereof are further optimized in view of a previous step. In optimization structural [physical] elements of the process design can be optimized, as well as particular setting in the process, such as parameters, e.g. temperature, pressure, flow rate, density, etc., in particular in view of interaction between process steps and apparatuses involved. Initially one could change a selection of the apparatus(es) involved, and then one could change the values of parameters, such as temperature and pressure. Parameter optimization is considered to be a more advanced stage. As mentioned, process flow diagrams play an important role in process design.

Typically, process flow diagrams of a process may include various elements, such as operational parameter data (see above), references to a mass balance, major equipment items, connections with other systems, identifications, such as process stream names, process piping, and major bypass and recirculation (recycle) streams. The typically do not include minor elements, such as minor bypass lines, instrumentation and details thereof, controllers like level or flow controllers, pipe classes or piping line numbers, isolation and shutoff valves, maintenance vents and drains, relief and safety valves, and flanges, though this is not a general rule. Process flow diagrams of multiple process units, within a large industrial plant, may as a consequence of the size and complexity usually contain less detail.

Nowadays a process flow diagram can be computer generated, such as from process simulators, using CAD packages, or using flow chart software using a library of chemical engineering symbols. Rules and symbols are available from standardization organizations such as DIN, ISO or ANSI as mentioned above. In view of complexity of a typical process, process flow diagrams may be produced on large sheets of paper. However, many non-digit- ized versions of process flow diagrams still exist, and often these are used in valuable and critical processes. Process flow diagrams of many commercial processes can be found in literature, specifically in encyclopedias of chemical technology, although some might be outdated. More recent ones can be found on-line. Typically these process flow diagrams relate to a pixel-oriented diagram, that is, wherein the diagram is present as an image as such, without the details of the image being incorporated as separate items or the like. In other words, the meaning of or information relating to various elements in the image in the real world do not form part of the image; as mentioned, often the diagrams are not even digitized at all. Also digitization of small elements in such diagrams may form a problem. Although promising results have been reported from previous studies, some shortcomings of prior research also becomes apparent. Firstly, all machine learning models (ML) in literature are typically trained on data sets from a single source, mostly a company cooperating with researchers, or even on synthesized data sets. Unsurprisingly, the accuracy of such models is near perfection, as the data exhibits little variation. It needs be acknowledged that retrieving piping and instrumentation diagrams (P&IDs) is not trivial, as companies naturally rarely publish their documentation. It is however doubtful that such models would generalize well to other data distributions, for instance diagrams generated with other CAD editors, making developed digitization approaches very isolated niche solutions. Secondly, most symbol data sets only consist of few categories, not reflecting the variety of equipment used in process industries. As a consequence of single source data sets, few different symbols are categorized, leading to a lack of a complete symbol categorization. Thirdly, the amount of data used for training is not reflecting the data driven nature of deep learning (DL) models. DL models are commonly trained on big data. Many DL approaches for P&IDs however rely on very little data with less than a hundred diagrams. Again, a possible explanation for this issue is the lack of publicly available data, combined with the time consuming nature of labeling such diagrams. Lastly, while there has been made considerable effort towards the task of digitizing P&IDs, to the best of our knowledge DL powered digitization approaches have not been applied to process flow diagrams (PFDs).

WO 2021/145138 Al recites a display device which acquires data on a plurality of devices installed in a facility, and stores, in an associated data memory unit, corresponding relationships between the devices installed in the facility and components in drawing data in which the devices installed in the facility are drawn as the components. Further, the display device, upon receiving a designation of a specific component among the plurality of components in the drawing data, selects a specific device corresponding to the specific component using the corresponding relationships stored in the associated data memory unit, and displays the data on the specific device and data on a device group having a causal relationship with the specific device in association with each other. The document may be considered as an example of the prior art identified above, showing some of the basic concepts for digitization in a rathe basic form.

So analyzing process flow diagrams in terms of e.g. functionality, digitally communicating process flow diagrams, making flow diagrams, appear to be in a stage wherein room for improvement is present.

The present invention relates to an improved system and method for analyzing a (chemical) process and providing a digitized set-up which overcomes one or more of the above disadvantages, without jeopardizing functionality and advantages.

SUMMARY OF THE INVENTION

The present invention relates in a first aspect to system for analyzing a chemical process, which system in principle can be used for any process, comprising a computer memory provided with digital representation of a directed graph representation of the chemical process, the graph representation comprising elements selected from apparatuses, flow modifiers, devices, process steps, flows, pipelines, signal lines, pressure regulators, temperature regulators, concentration regulators, chemical species regulators, controllers, and combinations thereof, and interactions between these elements, and a data processor provided with a computer program which, when running on the data processor, -provides trained machine learning, which is trained using a selection of a training dataset comprising directed graph representations of chemical processes and/or string representations of the directed graphs (such as SFILES) and resulting directed graphs and nodes representing elements, and annotated versions thereof, and as this typically is the training data set of the object detection algorithm, it typically includes the location of the objects on the image, e.g., through a bounding box, or a pixel-based mask, and the type of equipment, provides the digital representation, which may be regarded as an image, of the directed graph representation of the chemical process in the computer memory as input to the trained machine learning, and the trained machine learning providing in the computer memory the chemical process as directed graph with nodes and edges, which may be considered interconnections between nodes, defining the elements. Basically, a bounding box may be considered a box, the mask may be considered a flexible form based on pixels. So one can cut out objects accurately. In particular object detection architecture, object detection performance metrics, and skeletonization, are used. Therewith a system is provided which solves one or more of the above disadvantages. The present system, and likewise method, provide a system that detects unit operations and their connectivity in process flowsheets, such as chemical process flows. A directed graph is made therefrom. Therewith a full digitization is provided. The graph can be read automatically into a process simulation, such as process simulation software. A model of the graph can be created automatically. The graph may be considered as a knowledge graph. In the process of making the graph certain elements may be cut out, such as by using a mask, in particular for cutting out unit operations. A neural network or the like may be used, in particular for learning. In addition auto-completion of to be made graphs, such as of chemical flowsheets, is provided. Therein reinforcement learning and graph representation may be used. A suitable programming environment is Python. No graphical user interface is required. The graph results are found to be more accurate compared to prior art methods, and also more meaningful, that is representing the real environment better. It is also found to scale better.

The contribution of this invention is considered manifold. Firstly, inventors developed an extensive catalogue of unit operations in PFDs. As PFDs are only loosely based on a common illustration convention, inventors categorized symbols for unit operations based on their functionality as well as their appearance. Secondly, inventors collected and annotated a large PFD dataset. Inventors mined over 1,000 flowsheets from various sources including scientific literature. Thirdly, inventors developed object detection models that can identify unit operations in PFDs. The present system may be based on a state-of-the-art Faster R- CNN architecture, or a Mask R-CNN architecture. The present results show that the proposed system has competitive performance on the diverse data set. Lastly, inventors improved a pixel-based search algorithm to the specifics of PFD illustrations, such as different stream intersection illustrations and text in unit operations.

In a second aspect the present invention relates to a method of providing a digitized process set-up, the digitized process set-up with a sequence of at least two process steps, which sequence may be a linear sequence or a circular sequence or multiple cycles or a combination thereof, wherein the at least two process steps are selected from a chemical process step, a physical process step, a biological process step, and a micro-biological process step, in particular wherein process steps are selected from heating, cooling, flowing, reacting, mixing, contacting, depositing, annealing, separating, adding, removing, filtering, crystallizing, phase-separating, distilling, oxidizing, reducing, hydrogenating, de- hydrogenating, polymerizing, poly-condensing, esterifying, alkylating, de- alkylating, aminating, halogenating, sulfonating, nitrifying, de-hydrating, hydrolysing, and melting, comprising optically reading an image of a process set-up, digitizing said optically read process set-up forming a digitized image, which typically comprises pixels, using artificial intelligence, making a directed graph of the digitized image of the process set-up, the directed graph comprising a plurality of unique nodes and at least one [biological-]physical-chemical interaction between each first node and each second node of the plurality of nodes, and optionally at least one direction of said interaction, such as shown in figs. 2a-2d, wherein each node individually is selected from an end node, an intermediate node, and an intersection node, using artificial intelligence, identifying at least one physical object to each node in the directed graph, using artificial intelligence, identifying at least one process path, which may be referred to as interaction, or edge, or connection, between each first node and each second node of the plurality of nodes, and using rule-based ontology, in particular rule-based ontology obtained from a data model, such as ONTOCAPE, supplementing (also referred to as enriching) the directed graph of the digitized process set-up with the at least one process path and identified objects, or vice versa, in particular wherein the process is a chemical process.

In a third aspect the present invention relates to a use of the digitized process set-up for optimizing the process set-up, for forming a digital twin of the process set-up, for linking the process set-up to operational data, or for building a model of the process set-ups.

In a further aspect the present system may comprise instructions for carrying out the present method.

Thereby the present invention provides a solution to one or more of the above mentioned problems.

The present invention is also a topic of to be published scientific papers, entitled “Digitization of chemical process flowsheets using computer vision on big data” and “LEARNING FROM FLOWSHEETS: A GENERATIVE TRANSFORMER MODEL FOR FLOWSHEET COMPLETION”, which reference and its content is incorporated by reference.

Advantages of the present description are detailed throughout the description. References to the figures are not limiting, and are only intended to guide the person skilled in the art through details of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates in a first aspect to a system for analyzing a chemical process.

In an exemplary embodiment of the present method in the process set-up objects are localized and classified, such as a unit operation, an arrow, an intersection, a control unit, and text. An example of such a process flow is given in fig. 3.

In an exemplary embodiment of the present method artificial intelligence is based on a convolutional neural network or a neural network with a transformer architecture. An example of such a process flow is given in fig. 3.

In an exemplary embodiment of the present method artificial intelligence is trained on labelled data. An example of such a process flow is given in fig. 3. In an exemplary embodiment of the present method each node is provided with supplementary actors, wherein the supplementary actors are selected from chemical species, pressure, temperature, flow, concentration, controls, reactant, catalyst, product, pH values, composition, physical or chemical states, enzyme, biological species, nucleic acid sequence or part thereof, . An example is given in fig. 4.

In an exemplary embodiment of the present method based on the obtained directed graph or supplemented directed graph a layout of the process set-up is made comprising physical objects, the physical objects selected from apparatuses, in particular wherein apparatuses are selected from a tank, a column, a reflux, a reboiler, a boiler, a controller, a valve, a cooler, a mixer, a heater, a heat exchanger, a furnace, a filter, a mixer, a splitter, a phase separator, a absorber, a flash unit, a reactor, a pump, a flow controller, a compressor, a filter, a splitter, and a vessel. An example is given in fig. 5.

In an exemplary embodiment of the present method based on the obtained directed graph or supplemented directed graph a layout of the process set-up is made comprising chemical objects, the chemical objects selected from chemical species, catalysts, solvents, inert species, reactants, carriers, stabilizers, buffers, intermediate products, non-reactants, oxidants, and reductants. An example is given in fig. 6.

In an exemplary embodiment of the present method the directed graph is supplemented with a standard process model. An example is given in fig. 7.

In an exemplary embodiment of the present method nodes or actors are auto-com- pleted. An example is given in fig. 8.

A novel method is provided to learn from (chemical) process flowsheets and provide flowsheet structure recommendations, such as for engineers performing process synthesis. For example, the method may recommended one or multiple process nodes, their connectivity, and attributes or alternative process topologies (referred to as “auto- completion” or “auto-correction”). In this respect inventors created two data sets, the first one consisting of synthetically generated and the second one consisting of real flowsheets in graph format. Using the conversion algorithm for the automated conversion between flowsheet graphs and SFILES 2.0 strings, inventors automatically generated the corresponding text-based SFILES 2.0 data sets. The present inventors pre-trained a generative Transformer language model on the data set of synthetically generated flowsheets and fine-tuned it on the data set of real flowsheets. The trained generative Transformer model is capable of learning the grammatical structure of the SFILES 2.0 language and the patterns contained in the flowsheet topologies. Consequently, the results demonstrate that using the trained model for causal language modelling is a strategy to auto-complete flowsheet topologies. The input of the machine learning model may be a graph or string representation of a flowsheet. The output of the machine learning model may be a graph or string representation of a flowsheet or a part of a flowsheet. Using beam search as the decoding strategy yields the highest probability flowsheet completion. On the other hand, if more diverse flowsheet recommendations are preferred, the top-p sampling decoding strategy is a promising addition to beam search.

The invention is further detailed by the accompanying figures and examples, which are exemplary and explanatory of nature and are not limiting the scope of the invention. To the person skilled in the art, it may be clear that many variants, being obvious or not, may be conceivable falling within the scope of protection, defined by the present claims.

SUMMARY OF FIGURES

Figures 1, 2a-d, and 3-22 show aspects of the present invention.

DETAILED DESCRIPTION OF FIGURES

Figure 1 shows examples of symbols typically used in process flow diagrams.

Fig. 2a shows a non-limitative example of a process flow diagram, having to a certain extent arbitrary elements shown therein. Figure 2b shows a graph representing the process flow diagram of fig. 2a. Figure 2c shows a fully digitized process flow diagram, according to the graph of fig. 2b, and the process flow diagram of fig. 2a. Figure 2d shows schematically the method of providing a digitized process set-up, the digitized process set-up with a sequence of at least two events, wherein the at least two events are selected from a chemical event, a physical event, a biological event, and a micro-biological event. The process starts with the process flow diagram of fig. 2a, which is digitized. In the process of digitization objects represented in the process flow diagram of fig. 2a are detected. Further a flow path is explored, such that a graph can be made, in particular the graph of fig. 2b. Then the graph of fig. 2b is supplemented or enriched with the elements of fig. 2a, and optional further elements, wherein the elements are selected from physical-chemical interaction between each first node and each second node of the plurality of nodes, from objects within the figure 2a, parameters, etc. as is explained throughout the description and claims. Figure 3 shows an exemplary further process flow diagram. It is an objective of the present invention to localize and classify objects in flow diagrams, such as unit operations, arrows, intersections, and text, to use a deep learning model, which may be based on convolutional neural networks, to further supplement the process flow diagram, and to use a supervised learning approach, such as wherein the model is trained on labeled data, such as that of figure 3.

Figure 4 shows a complex process flow diagram which is digitized through computer vision, according to the invention.

Figures 5-6 show a use of an advanced model with a mask, in addition to the present method or system. The advanced model can identify a pixel-based mask for each object detected. The advanced model may be based on a Mask R-CNN architecture. Therewith basically unit operations are cut out more accurately than in a bounding box approach.

Also, typically it learns better, such as with less data.

Figure 7 shows automatic generation of UniSim models from process flow diagrams.

Figure 8 shows auto-completion of an exemplary process flow diagram. Starting with chemical species Fh and CO2, which are in a first step mixed (dashed oval) the present system provides suggestions for addition of a next step, apparatus, parameters, etc. (dashed-dotted oval). In addition thereto, or as an alternative, typically used elements are provided as optional selections at the right hand side of the screen (dashed dark oval). A user may select items from the pictograms on the right.

The invention although described in detailed explanatory context may be best understood in conjunction with the accompanying figures.

Experiment

The below is an example of how the invention could be implemented in practice. Fig. 2a shows an example flowsheet of a proposed Cumene production plant. The illustration was slightly altered, the flow structure however is kept. Via a procedure known as information extraction inventors automatically retrieved information of the chemical process representation in structured formats from unstructured data through several different methods.

Specifically, an introduction is given to object detection architectures, object detection performance measurement and skeletonization.

For object detection a distinction can be made between one-stage and two-stage detectors. Two -stage detectors contain a model that determines regions of interest with high probabilities of containing objects and a second model that classifies found regions of interest. On the other hand, one-stage detectors consist of a single network model that simultaneously predicts bounding boxes and classifications. Transfer learning refers to the improvement of model learning in one task by transferring knowledge from a related, previously learned task. With transfer learning, a model can initiate the training process on new data distributions with pre-trained weights, shortening training time and possibly leading to superior performance due to convergence to better optima. Backbone models in detection models are usually pre-trained on large datasets such as the ImageNet classification challenge dataset or the Common Objects in Context (COCO) dataset and during transfer training, parts of the network are frozen, meaning their parameter are not updated during training. Data augmentation methods are techniques used to increase the size of a limited dataset by adding modified copies of the data. Many augmentation techniques have been applied to image datasets in the literature, such as geometric transformations (e.g., stretching, skewing), flipping, color changes, cropping, rotation, translation, noise injection, random erasing, blurring, and more. Not all data augmentation techniques may apply to every dataset in every domain. Augmentations could reflect real varieties found in a data distribution. Feature pyramid networks (FPN) are a set of deep CNNs which construct features at different scales while keeping computation feasible. Feature pyramids are an important component in detection systems that facilitate the recognition of objects at different scales. The main objective of feature pyramids in a model is to allow a neural network to learn high to low-level features and independently make predictions at each level.

The objective of the object detection model is to localize and classify objects within images. Thus, two performances are typically evaluated, the placement of the bounding box around the object, and the classification accuracy of said bounding box. The most common performance evaluation metrics used herein are the Average Precision (AP) and Mean Average Precision (mAP), both of which consider correct, missed and false predictions in their respective calculation. The mAP is the primary metric used to measure a detector’s accuracy over all the object categories in a dataset. The mAP is found dependent on the Intersection over Union (loU) threshold chosen since it determines when a prediction is considered correct. The Pascal VOC AP metric, also known as AP50, is the mAP calculated at an loU threshold of 0.5. The COCO mAP metric, known simply as mAP, is the average of mAPs with loU thresholds in the range of [0.5:0.05:0.95]. Comparing the AP50 to the COCO mAP provides valuable insights into the performances of the classification and bounding box placement tasks individually, as a high ap50 and a low mAP suggest that object are correctly but imprecisely detected.

Skeletonization produces a compact representation of objects in images by reducing them to their medial axis, effectively transforming shapes to curves of a 1 - pixel thickness while preserving their connectivities. Figure 9 presents an example of distillation column skeletonizations. Imperfections in skeletonization can be observed when applying it to unit operations. In the digitization of PFDs, skeletonization facilitates the application of a graph search algorithm through a rule-based approach. In the development of efficient ML algorithms through supervised learning methods large amounts of valuable and diverse data for training, testing, and validation were used. As flowsheet digitization represents a gap in current literature, inventors further introduce a novel categorization based on visual and functional features with examples. Process flow diagrams were retrieved by applying the flowsheet recognition algorithm. The algorithm downloads all full text papers from a given source and extracts all images from said source. Then, a CNN classifier decides whether each figure is a flowsheet, or not. Inventors applied the algorithm to diverse sources, such as a number of journals, process engineering education books, and retrieved about one thousand flowsheets. Very few figures were wrongly classified as flowsheets, which is in accordance with the high accuracy of the algorithm. The diversity in data is found imperative, as ML models regularly fail to extrapolate outside their trained data distribution, meaning the object detection algorithm would fail to properly detect unseen ways of illustration unit operations.

Inventors defined main unit operations in chemical processes, and extended further on to incorporate equipment types and different illustrations. Additionally, class decomposition within unit operation types was utilized to increase model performance and to create a more consistent dataset. Class decomposition describes the method of splitting classes into different, more homogeneous sub-classes, decomposing the detection problem into a larger group of separate classes with similar topological characteristics. Such a technique can serve many benefits to supervised learning models by improving the class-to-instance association. Each sub-class exhibits more similar patterns within itself and more distinguishable patterns to other classes. In the context of PFD digitization, the class decomposition reasoning was based on two observations. Firstly, many classes contain clearly identifiable sub-classes of very different illustrations for the same equipment. As an example, the category pump was sub-divided into different categories. Another observation made on the flowsheets was that sub-classes could allow for more detailed information to be extracted from the data. For example, the unit operation categorization proposed in literature was a single valve, while inventors found a large variety of valves with different functionalities, such as control valves or check valves. Thus, further decomposing provided more information about used equipment. The mined flowsheets, comprising actors, objects, nodes, and interactions, were labeled using domain expertise and contextual information. The open- source graphical annotation tool Labelling was utilized. The quality of data provided to the object detection model is found to directly impact the predicting performance of the model. Thus, correct and consistent annotation of objects in the data are found important. In order to accelerate the annotation process, a semi -automation was employed. With a first batch of data, a preliminary model was trained and used for interference on unannotated data to create annotations. These were then corrected and used for further training of the model. Inventors found that this approach greatly accelerates the process of annotation, as the model quickly learns to detect the most common unit operations and human correction is only rarely necessary for more uncommon objects.

The used digitization approach may involve several distinct steps from an image to a graph representation. First, an object detection model is used to detect unit operations, such as those of figure 1. Text as well as arrowheads indicating stream directions may be detected by a second object detection model. The found bounding boxes of arrowheads and unit operations are filled before skeletonization is applied to facilitate skeletonization. With the skeletonized image and the locations of unit operations known, connectivity among unit operations are explored. In the following, inventors will discuss the steps unit operation detection, and stream recognition, in more detail.

Various information are encoded in flowsheets. Apart from unit operations, there may be important information contained in text and arrows as well. In total, inventors trained two separate object detection models for different tasks: (1) detection of unit operations and unknown units, (2) detection of arrows, path intersections, and text. For object detection, the Faster R-CNN architecture was used. The choice of a backbone model is hereby one of the most crucial decisions for performance. Inventors used three different backbone models, which mostly differ in their architecture deepness. Pretraining the backbone model, even though on an unrelated dataset, typically increases model performance as the backbone model will learn to extract distinct features. This will help convergence on a flowsheet dataset even with a limited number of flowsheets. To account for imbalance among categories in the dataset, repeat factor sampling is applied. Repeat factor sampling allows to train images with underrepresented categories more often to account for slower learning effects. Repeat factor training is especially important for our dataset as some unit operations are seldom found in literature, while others, such as heat exchangers or pumps, are naturally often present. Hence, without repeat factor sampling, an imbalance in performance can occur. Furthermore, to increase generalization, several augmentation techniques are applied during training. Thus, a set of applicable augmentation methods were identified, and the effect of data augmentation on the object detection model performance was investigated. Specifically, the techniques of flipping, adding noise, blurring, and repetition of rare objects were applied and studied.

The detection of unit operation is the first step in digitization scheme. After unit operations have been successfully detected, their bounding boxes are processed. Bounding boxes with significant overlap, measured in intersection over union, are compared and the one with the lower confidence score is removed. This is necessary as rarely the object detection algorithm detects objects twice with different categorization. Afterwards, detected unit operations with a confidence score lower than a threshold are converted to a category X, indicating a low confidence of the model. The flowsheet image is binarized and then reduced to one -pixel thin layers of object, allowing stream recognition. Once the PFD has gone through the first stage, the skeletonized flowsheet is prepared for the graph search algorithm. First, the skeletonized image is represented as a graph in which each pixel is a node. In this graph, each node has a maximum of 8 edges corresponding to the 8 neighboring pixels. Additionally, each node in the graph contains information on its color and whether it is inside an object bounding box or not. Starting from a unit operation, the program checks for white pixel neighbors along the bounding box border, identifying possible paths. For each path, the algorithm traverses the graph along neighboring white pixels and continues the search. A graphical representation of this procedure is shown in Figure 10. A connection between two objects is established when the algorithm reaches a pixel belonging to a new unit operation. If the exploration reaches a dead end, it creates an ”In/ Out” stream object, indicating an incoming or outgoing stream of the process. Once all the outgoing paths from a unit operation are explored, the algorithm moves to the next unit and repeats the search, storing information about all detected connections. After the graph search, information is saved on the connections between unit operations. Finally, the graph representation of the flowsheet is constructed using the NetworkX open-source Python package. A graph is created with each unit operating as a node and the streams between them as directed edges. Each edge and node in the graph allows for adding attributes, such as associated text and operating conditions and can be handled for further processing

For auto-completion the following example is given. It is noted that the subject matter of the present system and method and the auto-completion may overlap, and therefore that elements of these embodiments may be combined.

The present inventors make use of a transformer-model architecture and decoding strategies used for text generation in natural language processing (NLP). Furthermore, it recaps the used flowsheet representations, namely flowsheet graphs and the SFILES 2.0 notation. The latter is used to represent the flowsheet data in a text-based manner in order to enable using NLP models. Transformer-based models increased the performance in several benchmark tasks and also show successful applications beyond the human language. Text may be processed as a sequence of tokens, whereby the tokens are either words or other chunks of the input sequence. Tokenization is typically the first text processing step in NLP and follows a tokenization strategy. After to- kenizing the input sequence, each token is converted to a vector by using a learned numerical embedding. Putting together all inputs’ vectors yields a matrix, called input embedding in the following, which can be processed by the NLP model. In a further example the original Transformer architecture is a neural sequence translation model consisting of an encoder stack of N = 6 identical layers and a decoder stack of N = 6 identical layers in sequence. The decoder uses the encoder’s output and the previously generated outputs to compute the output probabilities for the next token. Each encoder layer contains two sub-layers with subsequent layer normalization. Each decoder layer contains three sub-layers with subsequent layer normalization. Since recurrent components are completely removed in the Transformer architecture, before input and output embeddings are passed to the encoder and decoder, respectively, positional encoding is applied. Positional encoding ensures that the information of the order of tokens in the sequence is taken into account. The core components of the Transformer architecture are the attention sub-layers. The calculation of attention takes a query vector q, key vector k, and value vector v for each input token and compares all queries against all keys resulting in scores for query-key compatibility. The compatibility scores are then used as weights to calculate the attention output as a weighted sum of the values. In practice, the attention is computed for all inputs of an input sequence in parallel, putting together all query, key, and value vectors in the query matrix Q, key matrix K, and value matrix V. This finally yields a matrix as attention output. In the original architecture, multi-head attention is used as self-attention layers in the encoder, as masked self-attention in the decoder, and as encoder-decoder attention to combine the vector embedding of the encoder with the previous decoder outputs. Hereby, self-attention means that query, key, and value matrices are calculated from the same input sequence. Therefore, the computed attention represents each token and its meaning in the sequence. Self-attention in the encoder considers both the left and right context of each token (bidirectional). Contrary, in the case of masked self-attention in the decoder, only the left context is used, meaning that subsequent positions of each token are masked out (unidirectional). For decoder-only architecture for causal language modeling a GPT-2-like model architecture only containing a decoder stack is used. Each decoder layer consists of a masked multi-head self-attention sub-layer and a feed-forward sub-layer. Since the encoder is left out, the encoder-decoder attention sub-layer is left out, too. Several decoding strategies may be used.

For auto-completion the following example is given in Figure 11, relating to a simple chemical process flowsheet with branchings, recycle stream, and different mass trains. With the above method figure 12 is obtained, being a Graph representation of flowsheet in Figure 11. Two consecutive unit operations in the string imply a normal stream connection. In the case of a branching such as after a distillation column, all but the last branch are noted in brackets. Recycles are noted by using numbers # to reference the recycle start node and <# to reference the recycle end node. Furthermore, tags in braces are used to indicate whether the branch is a top or bottom product. In the case of converging branches, the second branch is inserted in the string, surrounded by <&| and &|. Multi-stream heat exchangers are separated in one node per stream compartment and marked with a number in braces, capturing which streams are heat integrated. In an example inventors subdivided flowsheets into the following subprocess categories; Initialization: Feed(s); Reaction; Thermal separation (distillation, rectification); Countercurrent separation (absorption, extraction); Filtration (gas, liquid); Centrifugation; and End: Purification.

As illustrated in Figure 13 the last three blocks relate to a procedure for multiple branches. The block represent from left to right: Initialize graph with feed(s); First subprocess category + pattern in category; Next subprocess category for each stream + pattern in category; and Purification of stream Optional: random heat integration or recycle. After initializing the flowsheet graph with raw materials, including feed preprocessing, the selection of the first sub-process, excluding purification, is a Markov transition with fixed probabilities (transition probabilities do not depend on previous unit operations). Within each sub-process, we further sample from a set of patterns (not shown here) specifying how the inlet and outlet stream(s) are processed, e.g., with additional temperature or pressure change unit operations. Also, we include design heuristics such as adding recycles, performing heat integration in reaction sub-process, or adding reactants. In general, the sub-processes lead to several outlet streams, in the following referred to as branches. For each branch, we transition to the "Next sub -process" state followed by a Markov transition to the next sub-process. This selection differs from the first sub-process selection by the additional purification sub-process. Note that once a branch reaches the purification step, it is determined to end as a product. After each branch ended in the purification step, the flowsheet graph generation is complete.

Figures 14-15 show a completed flowsheet using beam search. Figure 16 schematically illustrates the auto-completion of flowsheets using the Generative Flowsheet Transformer. Inventors achieve this by specifying an input sequence in SFILES 2.0 that represents the incomplete flowsheet and pass it to the Generative Flowsheet Transformer which auto-completes the sequence in SFILES 2.0 language. The completed flowsheets correspond to the completed SFILES 2.0 sequences with the Generative Flowsheet Transformer. Figures 17-21 show completed flowsheets using top-p sampling.

Table 1/Fig. 22 shows exemplary Unit operations and abbreviations in SFILES 2.0.

It should be appreciated that for commercial application it may be preferable to use one or more variations of the present system, which would similar be to the ones disclosed in the present application and are within the spirit of the invention.

Claims

1. A system for analyzing a chemical process, comprising: a computer memory provided with digital representation of a directed graph representation of the chemical process, the graph representation comprising elements selected from apparatuses, flow modifiers, devices, process steps, flows, pipelines, signal lines, pressure regulators, temperature regulators, concentration regulators, chemical species regulators, controllers, and elements thereof, and combinations thereof, and interactions between these elements, and a data processor provided with a computer program which, when running on the data processor,

-provides trained machine learning, which is trained using a selection of a training dataset comprising directed graph representations of chemical processes and/or string representations of the directed graphs and resulting directed graphs and nodes representing elements, and annotated versions;

-provides the digital representation of the directed graph representation of the chemical process in the computer memory as input to the trained machine learning, and

- the trained machine learning providing in the computer memory the chemical process as directed graph with nodes and edges defining the elements.

2. A method of providing a digitized process set-up, the digitized process set-up with a sequence of at least two process steps, wherein the at least two process steps are selected from a chemical process step, a physical process step, a biological process step, and a micro-biological process step, in particular wherein process steps are selected from heating, cooling, flowing, reacting, mixing, contacting, depositing, annealing, separating, adding, removing, filtering, crystallizing, phase-separating, distilling, oxidizing, reducing, hydrogenating, dehydrogenating, polymerizing, poly-condensing, esterifying, alkylating, de-alkylating, aminating, halogenating, sulfonating, nitrifying, de -hydrating, hydrolysing, and melting, comprising optically reading an image of a process set-up, digitizing said optically read process set-up forming a digitized image, using artificial intelligence, making a directed graph of the digitized image of the process set-up, the directed graph comprising a plurality of unique nodes and at least one bio- logical-physical-chemical interaction between each first node and each second node of the plurality of nodes, and optionally at least one direction of said interaction, wherein each node individually is selected from an end node, an intermediate node, and an intersection node, using artificial intelligence, identifying at least one physical object to each node in the directed graph, using artificial intelligence, identifying at least one process path between each first node and each second node of the plurality of nodes, and using rule-based ontology, in particular rule -based ontology obtained from a data model, supplementing the directed graph of the digitized process set-up with the at least one process path and identified objects, or vice versa, in particular wherein the process is a chemical process.

3. The method of providing a digitized process set-up according to claim 2, wherein in the process set-up objects are localized and classified, such as a unit operation, an arrow, an intersection, a control unit, and text.

4. The method of providing a digitized process set-up according to any of claims 2-3, wherein artificial intelligence is based on a convolutional neural network or a neural network with a transformer architecture.

5. The method of providing a digitized process set-up according to any of claims 2-4, wherein artificial intelligence is trained on labelled data.

6. The method of providing a digitized process set-up according to any of claims 2-5, wherein each node is provided with supplementary actors, wherein the supplementary actors are selected from chemical species, pressure, temperature, flow, concentration, controls, reactant, catalyst, product, pH values, composition, physical or chemical states, enzyme, biological species, nucleic acid sequence or part thereof.

7. The method of providing a digitized process set-up according to any of claims 2-6, wherein based on the obtained directed graph or supplemented directed graph a layout of the process set-up is made comprising physical objects, the physical objects selected from apparatuses, in particular wherein apparatuses are selected from a tank, a column, a reflux, a reboiler, a boiler, a controller, a valve, a cooler, a mixer, a heater, a heat exchanger, a furnace, a filter, a mixer, a splitter, a phase separator, a absorber, a flash unit, a reactor, a pump, a flow controller, a compressor, a filter, a splitter, and a vessel.

8. The method of providing a digitized process set-up according to any of claims 2-7, wherein based on the obtained directed graph or supplemented directed graph a layout of the process set-up is made comprising chemical objects, the chemical objects selected from chemical species, catalysts, solvents, inert species, reactants, carriers, stabilizers, buffers, intermediate products, non-reactants, oxidants, and reductants.

9. The method of providing a digitized process set-up according to any of claims 2-8, wherein the directed graph is supplemented with a standard process model.

10. The method of providing a digitized process set-up according to any of claims 2-9, wherein nodes and/or actors are auto-completed.

11. Use of the digitized process set-up for optimizing the process set-up, for forming a digital twin of the process set-up, for linking the process set-up to operational data, or for building a model of the process set-ups, in particular the digitized process set-up obtained by the method according to any of claims 2-10.

12. The system according to claim 1, comprising instructions for carrying out the method of any of claims 2-10.

13. The system according to claim 1 and/or the method according to any of claims 2-10, further comprising one or more elements according to the description, in particular according to the examples, more in particular using one or more of object detection architecture, object detection performance metrics, skeletonization, processing a bounding box, processing a mask, using a diverse variety of data sources, using data categorization, using data annotation, using labeling of objects, using labeling of actors, repeating one or more steps, unit operation detection, stream recognition, factor sampling, augmentation of objects and/or actors, using pixels, using artificial intelligence-assisted process synthesis, using a transformermodel architecture, using natural language processing, using decoding, tokenization, and numerical embedding.