WO2022113142A1

WO2022113142A1 - Method for word processing and for generating a text document

Info

Publication number: WO2022113142A1
Application number: PCT/IT2021/050382
Authority: WO
Inventors: Davide ROSSETTI
Original assignee: Aixcellence S.R.L.
Priority date: 2020-11-26
Filing date: 2021-11-25
Publication date: 2022-06-02

Abstract

A method for generating a textual target document (400) starting from a first and a second source document (100, 200), comprises: a phase of receiving (Fll) of the first source document (100), including a first group of words; a step of receiving (F12) of the second starting document (200), including a second group of words; a separation step (F21) of the first group of words into a first plurality of textual groups (101), on the basis of a punctuation of the first document of departure (100); a step of separating (F22) the second group of words into a second plurality of textual groups (201), on the basis of a punctuation of the second starting document (200); a phase of linguistic identification (F31, F32), in which each word of the textual groups of said first and second plurality of textual groups (101, 201) is associated with: an information datum, representative of a function covered by the word in the context of the respective textual group (101, 201); relational data, representative of a relationship between the word and the other words of the corresponding textual group (101, 201), on the basis of proximity data, representative of a relative position of each word with respect to the other words of the corresponding textual group (101, 201); a processing step (F4) of the textual groups of said first plurality and said second plurality (101, 201) by means of one or more predetermined logical operators; a generation step (F5) of a third plurality of textual groups (400), defining the destination document (400), on the basis of the processing step (F4).

Description

Method for word processing and for generating a text document.

Technical field

The present invention relates to a method for generating a textual document, a machine for generating a textual document and a program for word processing in natural languages. Said method and related application tools being conveniently used for the automatic generation of documents and particularly for the generation of a document starting from starting documents.

Background art In the field of processing and processing textual documents, various solutions are known which receive a text as input and perform a series of operations on the latter. Said series of operations are all aimed at carrying out elaborations on the single text or on one of its sub-parts (sentences, sentences).

A very common example is that of the search function, through which a user enters a text string and the machine processor returns a match within the text, this information is vector / multiple and provides the user with a series of positions in the text where said string was found.

In the text processing sector, it is also normal practice to subdivide the text on the basis of punctuation, to define separable entities according to assigned criteria.

Finally, some solutions show how it is possible to identify the function of words within a text, for example by comparing them with one or more predefined word vocabularies.

However, as anticipated, all these solutions are focused on the processing of the single text in which the user is performing a series of modification, search and composition operations. In this regard, it is important to underline that, on the other hand, the research and development sector is increasingly demanding in requesting processing systems that accelerate the identification of documents that are related to each other and can be combined to determine new solutions derived from their processing . The solutions for the analysis of texts currently known in the sector do not in any way solve this need as they are in no way capable of working on multiple documents. Said solutions are for example described in documents EP3198490A1 and US2020193153A1.

In addition to the aforementioned patent solutions, further scientific proposals and methods are known that aim at a reduction of the textual components, either through numerical compression algorithms, or through the amputation of parts of the article to create an agile synthesis to be viewed. Among these we remember the scientific paper "Generating Summaries of Multiple News Articles" [Kathleen McKeown et Al, International ACM Sigir Confernence, 1 July 1995], said scientific paper concerning the generation of summaries of different articles with similar news (or different according to the sources and authors) through the production and reasoned composition of the respective summaries that can be viewed in order to be easily consulted (digital bulletin boards). We also mention scientific paper "Dictionary Compression: Reducing Symbolic Redundancy in RDF" [Farina Antonio et Al, 23 August 2017] >and focuses on a method aimed at textual compression of the contents present in the classic dictionaries. In these, as in other state-of-the-art solutions, texts or semantic contents are 5 processed thanks to a numerical processing aimed at the mere reduction of textual components and aimed preferably at the categorization, sequencing and fast retrieval of articulated and tagged information in appropriate archives. In the face of these known solutions to reduce, summarize and simplify texts (mostly single), there are currently no automatic solutions to process, elaborate, operate selectively on micro-parts of text and aimed at expanding the text itself (contents), enriching it. 10 (descriptions), recomposing it with others (from sources potentially so different as to be antipodal) in a perspective of "creative mosaic" that can be automated at the speed of current computers.

Aims of the invention

The object of the present invention is to make available a method, a machine and a computer 15 program which overcome the drawbacks of the known art mentioned above. Said object is fully achieved by the method, by the machine and by the program for processing objects of the present invention which is characterized by the contents of the claims reported below.

According to an aspect of the present description, the present invention provides a method for generating a textual target document starting from a first and a second starting document. It should be 20 noted that the present invention also provides a method for generating a textual target document starting from a single source document.

Brief description of the invention

The method according to the proposed invention comprises a step of receiving the first starting 25 document including a first group of words. In one embodiment, the method comprises a step of receiving the second source document, including a second group of words.

The method comprises a step of separating the first group of words into a first plurality of textual groups. Said separation step is preferably performed on the basis of a punctuation of the first starting document.

BO In one embodiment, the method comprises a step of separating the second group of words into a second plurality of textual groups. Said separation step is preferably performed on the basis of a punctuation of the second starting document.

In one embodiment, the method comprises a linguistic identification step (lexical analysis, tokenization). In the linguistic identification step, each word of the textual groups of said first and / or 35 second plurality of textual groups is associated with an information datum. The information data is representative of a function covered by the word within the respective textual group. The information data is determined on the basis of a reference list including a plurality of pairs, each including a word and a corresponding associated information data. In other words, information data is information about the role of the word in the sentence, including the role of noun, adjective, verb, adverb, article or preposition.

In the linguistic identification step, each word of the textual groups of said first and / or second plurality of textual groups is associated with relational data. Relational data is representative of a relationship between the word and the other words of the corresponding textual group. The relational data are defined (deduced) on the basis of proximity data (i.e. from the topology within the text), representative of a relative position of each word with respect to the other words of the corresponding textual group. In other words, relational data is information about the relationship between one word and another. Purely by way of example, a relational datum of an adjective is the noun (or a reference to the noun) to which said adjective refers.

In one embodiment, the linguistic identification step is performed via a previously trained neural network.

The neural network used here is a computational mathematical model based on biological neural networks. This model includes a group of information interconnections made up of artificial neurons and / or processes. It is an adaptive system that changes its structure based on the information flowing through the network itself during the learning phase. In practical terms, the neural network is an organized non-linear structure of statistical data. Here it is used to simulate complex relationships between input and output texts. The neural network used here receives the textual information on a layer of input nodes (processing units), each of which is connected with numerous internal nodes, organized in several levels. Each node processes the received signals and transmits the result to subsequent nodes. Using a neural network has several advantages, listed below. Neural networks as they are built work in parallel and are therefore able to process a lot of data. Basically it is a sophisticated statistical system with good noise immunity; if some units of the system were to malfunction, the network as a whole would have performance reductions (results of the logical operators only approximate) but it would be difficult for the system to crash. However, programs dedicated to neural networks require good statistical knowledge.

However, the neural network itself is not characterized by having defects as the models produced by neural networks, even if very efficient, cannot be explained in human symbolic language: the results must be accepted "as they are" (neural networks are often considered as a "black box"): unlike an algorithmic program where each phase of the path from input to output is clearly traceable , a neural

B network is able to generate a valid (or statistically acceptable) result but it is not possible to explain how and why (with what steps and according to which assumptions) this result was generated.

The use of the neural network allows, in the event that the reference list (word list), initially empty and progressively populated on the basis of elaborated texts, does not include a word from the first and / 5 or second source text, to insert it in the reference list (in the word list), with the respective information data, which the neural network is aware of thanks to the previous training carried out thanks to the starting dictionaries.

In one embodiment, the method comprises an alphabetical ordering of the reference list, following the insertion of the new word in the reference list.

10 In one embodiment, the method comprises a step of processing the textual groups of said first plurality. In one embodiment, the method comprises a step of processing the textual groups of said second plurality. Said processing step is performed through one or more pre-established logical and algebraic operators.

In one embodiment, the processing step takes place between the textual groups of said first plurality 15 or between the textual groups of said second plurality. In a preferred embodiment, the processing step takes place between the textual groups of the first and second plurality, so as to perform operations between the first and second document.

The method comprises a step of generating a third plurality of textual groups, defining the target document, on the basis of the processing step (logical / algebraic computation). Therefore, the 20 destination document can be constructed / the result of / the elaborations carried out only on the first source text or it can be the result of cross-processing between the first and the second source document.

These cross-processing allow to generate a target document that includes the content of both the first document and the second document, specifically combined on the basis of the predefined logical 25 operators and available to the processing itself. This allows you to make combinations of concepts that the human being, with his cognitive effort, would be limited in doing, if only for the computing resources available to the processor. This could allow to provide the researcher with a new concept derived from the specified elaborations, which, albeit in the raw state , stimulates the researcher in the search for new solutions.

BO In one embodiment, the method steps described for the first and second source documents (reception, separation and linguistic identification) are replicable for a number of documents greater than two.

Therefore, the method is able to process a number of source documents greater than or equal to one (i.e. greater than or equal to two). This further increases the possibilities of identifying possible 35 innovative ideas defined by the coupling of concepts introduced in different documents. In one embodiment, the method comprises a coding step. In the coding step, each textual group of said first and / or second plurality of textual groups is encoded according to a predetermined information transmission protocol. The predetermined information transmission protocol includes a plurality of coding rules.

5 The coding phase allows to generate a corresponding identification string. The identification string of each textual group is representative of each word, of the information data of each word and of the relational data of each word of the corresponding textual group. The coding step is advantageous for two fundamental reasons. On the one hand, it allows to transmit the textual groups (also post processing) in an encoded way, increasing the security in the transmission of data, which, being 10 unencrypted, cannot be simply intercepted and decrypted. Furthermore, the coding allows to perform any algebraic (ie logical) operations with greater speed and repeatability that would certainly be more difficult with the use of plain words.

In one embodiment, the identifying string comprises content characters. Each content character identifies a corresponding word and / or its informative data. In other words, the

15 character of content is a character (alphanumeric) that identifies a word and its given information. Purely by way of example, the content character is a number, which corresponds to a progressive number in the reference list (reference data), in which the plain word and its information data are reported, i.e. if the word is a noun or other type of information.

In one embodiment, the content characters are separated by separator characters.

20 In one embodiment, the relational data is defined by a sequential position function of the content characters in the identifier string. In other words, relational data is determined on the basis of the reciprocal position between the two content characters. For example, if the content character relating to an adjective is placed immediately after a noun content character (separated by a period), the processor derives the information about the fact that this adjective refers to the preceding noun.

25 In a form of embodiment, the character of content is a number indicating a position of the corresponding word in the reference list. In one embodiment, in the coding step, the first textual document and / or the second textual document are encoded as an ordered sequence of the identifying strings of the textual groups of said first plurality and / or of the textual groups of said second plurality of groups textual, respectively.

BO In one embodiment, in the processing step, said predetermined logical operators can process both on the coded identification strings (preferably) and on the plain words.

In one embodiment, in the processing stage, said predetermined logical operators include a recurrence operator, which determines a recurrence datum, representative of a number of times in which an identification string of the first starting document and / or of the second the source document 35 is retrieved in the first starting document and / or in the second document of departure. In one form of implementation, in the stage of processing, the recurrence operator is also applicable to a character content of an identifying string. In this case, it is possible to determine a recurrence of a word in the first and / or second source text. In one embodiment, in the processing step, said predetermined logical operators include an allocation operator. The allocation 5 operator determines an allocation datum, representative of a position of an identification string of the first starting document or of the second starting document in the first starting document and / or in the second starting document. Again, it is possible to execute the allocation operator on a single content character (that is, on a single word), to determine the position of the content character in the first source document and / or in the second source document. In one embodiment, in the processing step, 10 said predetermined logical operators include a replication operator.

The replication operator receives a predefined replication number. The term predefined is intended to mean a number that the replication operator does not calculate by itself but which it receives as input. The default replication number can be entered by a user or calculated by the processor based on other operators or based on specifications set at the design stage of the computer program 15 executable in the processor. The replication operator generates a replicated string , including a replicated starting identifier string for the default replication number.

In one embodiment, in the processing step, said predetermined logical operators include an extension operator. The extension operator receives one or more words. The encoding extension operator said one or more words. The extension operator generates an expanded string, including a string 20 identifying the first or second starting document integrated with said one or more coded words. It is also foreseen that the extension operator receives directly said one or more already coded words.

The extension operator allows you to integrate aspects and characteristics (words), for example present in the first starting document, to a concept (identification string) present in the second starting document, in order to generate a new, potentially new and innovative concept .

25 In one embodiment, in the processing step, said predetermined logical operators include a composition operator. The composition operator generates a composite string, starting from a first identifying string of the first or second starting document and a second identifying string of the first or second starting document. The composition operator allows to integrate a first concept (first identification string), for example present in the first document of departure, to a second concept BO (second identification string) present in the second starting document, in order to generate a third concept ( third identification string), potentially new and innovative.

In one embodiment, in the processing step, said predetermined logical operators include a context operator . The context operator calculates a number of words, whose information corresponds to a noun (i.e. the number of nouns) or adjective (i.e. the number of adjectives), present 35 in the first source document and / or in the second source document . In one embodiment, in the processing step, said predetermined logical operators include an affinity operator. The affinity operator determines, for each identification string of the first starting document, a corresponding affinity value with each identification string of the second starting document. In one embodiment, the affinity operator calculates a resulting affinity value, identifying an affinity between 5 the first starting document and the second starting document, on the basis of the reciprocal affinities between the identifying strings of the first and second document of departure.

In one embodiment, the affinity is determined by artificial intelligence algorithms, preferably by means of a neural network, known to those skilled in the art of artificial intelligence. In particular, in one embodiment, the method comprises a neural network training step. The neural network is trained 10 through a large number of lemmas found through the choice of some dictionaries covering different nomenclatures used; a plurality of vocabularies, including corresponding words. In one embodiment, the neural network refers to the algorithmic category called similarity. For more information, for the purpose of sufficient description, the neural network solutions described at the following links are included by reference: spacy.io, www.matlab.com, www.nltk.org.

15 In the training phase of the neural network, the neural network itself is trained (numerically configured according to specific numerical sequences of numbers in input and desired numbers in output) to allow it to find correlations between words and their respective similarities (lemma, meaning). In detail, the training phase includes a phase of setting the values of the outputs (assigned real numbers between 0 and 1) of the neural network between a value equal to one if the compared words (for the 20 lemma, meaning) are identical and a value zero if they are not..

The neural network returns values between 0 and 1 in the case in which there are partial similarities (e.g. same meaning but different lemma; for example in the case of the result of the affinity function applied to wood, rigid words ). Therefore, the neural network calculates the affinity value based on a match between words. This allows to detect affinities even between synonyms or 25 apparently not comparable words.

The training phase includes an initial phase, in which the neural network, not knowing the similarity between words, assigns "incorrect" (implausible) similarity values. The training phase corresponds to a phase of assigning reference values which are by their nature correct and comprises a plurality of correction functions, determined on the basis of the setting phase, and a training database. After an BO initial transient, on the basis of the correction functions, the neural network begins to make the assignments that are proper to it at that instant of the numerical simulation and returns plausible values (e.g. for two equal words it produces a 0.95 and tending to provide 0,9999999999, that is, a number comparable to unity which semantically means a total affinity that corresponds to a real concept of equality). Once the training phase has been carried out, the neural network is able to carry out the linguistic identification phase autonomously, progressively saving the words it encounters in the reference list. In particular, if the word is not present in the reference list, the neural network recognizes it, identifies it and saves it in the reference list. Furthermore, following the training phase, the neural 5 network is able to independently calculate the affinity value.

In one embodiment, in the processing step, said predetermined logical operators include a grafting operator. The grafting operator integrates an identification string of the second starting document with the identification strings relating to the first starting document, or vice versa. The integration carried out by the operator of graft integration is an ordered according to the sequence provided by the 10 construction protocol of the strings in their phase specification of decoding (step initial). In other words, the graft operator integrates an identifying string of the second starting document in a specific position of the ordered sequence of identifying strings that defines the first starting document.

In one embodiment, in the processing step, resulting final identification strings are generated. The 15 resulting identification strings are preferably defined by the operators of composition, graft extension.

In one embodiment, in the processing step, it is provided that the logical operators of recurrence, affinity, context and allocation are primarily executed . On the basis of the outcomes of the logical operators of recurrence, affinity, context and allocation, the method involves performing the 20 grafting, extension, composition and replication operators in a discriminated way, i.e. for those words (content characters) or textual groups (identifying strings) that meet certain requirements. This mode allows you to filter any combinations between concepts that are not very similar to each other. In other words, the processor executes the grafting, extension, composition and / or replication operators only for the identifying strings, words or textual groups that are, in fact, related to each other and whose 25 combination could, with good probability (necessary metric , it is not mathematics), produce an innovative concept, or produce an innovative concept with a probability greater than 75% (good probability).

In one embodiment, the method comprises a decoding step, in which said resulting final identification strings are decoded, to generate the third plurality of textual groups and define the destination BO document resulting from the processing of the selected and previously described operators. The decoding phase allows the user, the researcher, to have a readable text to be analyzed in order to develop any new ideas and / or creations.

In one embodiment, the method comprises an allocation step. In the phase of allocation, each group textual of said first and / or second plurality is related to the 35 other groups textual of the corresponding text. This allocation is performed on the basis of a reciprocal allocation (position) of each textual group in the first starting document and / or in the second starting document, to evaluate a relationship between the textual groups of said first and second plurality. In other words, in the allocation phase, a textual group is related to the other textual groups to understand if there are relationships and / or dependencies of a textual topological type (specific 5 positions in the text). For example, a textual group could refer to a word introduced within another textual group. In this case, in the allocation phase, the processor identifies a connection between the two textual groups.

In one embodiment, the method comprises a graphical modeling step, in which, for each textual group of said first and / or second plurality of textual groups, image data is generated. Said image data are 10 generated on the basis of each word, of the functional data of each word and of the relational data of each word of the corresponding textual group. The image data being representative of a graphical representation of each textual group or of the entire first or second plurality of textual groups.

In one embodiment, the image data is representative of a three-dimensional graphic representation of each textual group of said first and / or second plurality.

15 In one embodiment, the information data is representative of a linguistic label afferent to the respective word, said linguistic label being an article, adverb, noun, adjective or verb.

In one embodiment, in the step of separating the first and second group of words, the identified textual groups are separated by a period and / or a comma.

In one embodiment, the method comprises an instruction step. In the instruction phase, the processor 20 saves in the reference list a new pair, including a word and the corresponding associated identification data, which is entered manually by a user or searched through a search algorithm in the network (eg API). The processor saves the new pair in case the word found in the text does not find a match in the reference list. In this way, said reference list (the reference data) is continuously updated, increasing the level of completeness. This level of completeness is directly proportional to the number of words 25 available (reference list) through which the texts are processed.

In a form of embodiment, the method includes a step of linguistic recognition. In the linguistic recognition phase, the method provides for comparing one or more words of said first group of words with the reference list, in order to derive information about the language of the first source document. The reference list is divided into groups, each associated with a respective BO language.

Therefore, when the processor determines a minimum number of correspondences between the words of the first starting group and the words of a specific group of the reference list, then it derives and assigns the language of the first starting document.

According to an aspect of the present description, the present invention provides a machine for 35 generating a textual target document starting from a first and a second starting document. The machine includes a memory. In one embodiment, the machine executes the logic / algebraic operators in automatic (pre-configurable) mode. In other words, every time the processing phase is started, the phase itself is performed according to sequences of applied operators with fixed (preconfigured) characteristics.

5 In one embodiment, the machine executes the logical / algebraic operators in manual mode, ie every time the processing step is started, the latter is performed according to sequences of operators decided by the user of the machine himself.

The machine includes a user interface, for loading the first source document and the second source document. This interface also has the task of offering the availability of selecting the preferred 10 processing method by the user himself. The selection of the method includes the definition of a sequence of logical / algebraic operators launched for the processing of specific texts. The machine comprises a processor, configured to carry out the steps envisaged by the method according to any one of the aspects introduced in the present invention. The user interface comprises one or more input tools and one or more output tools, and one or more 15 execution mode selection tools (manual or automatic). Said one or more input tools (ie selection) allow you to define a mode of launching the processing phase, between automatic or manual mode. Said one or more input tools can be a keyboard, a mouse, a microphone, a camera, a camera, a scanner. Said one or more output tools can be a screen (touch screen) or a speaker.

The memory comprises the reference list, including a plurality of pairs, each including a word and a 20 corresponding associated information data. In other words, the memory comprises a list of words which is used for the recognition of words and their information.

In one embodiment, the machine comprises a connection to the network. In this embodiment, it is envisaged that the processor can use external vocabularies, available on the network or directly entered by the designer / configurator of the machine, through suitable API requests, Application 25 Programming Interface. These API requests include the word specification. The API request produces a response, including at least the information corresponding to the word.

In one embodiment, the present invention provides a computer program including instructions for carrying out the steps envisaged by the method according to one or more of the features described in the present invention when performed by the processor of the machine according to one or more of BO the features described herein, found it.

Brief description of the drawings

Further characteristics and advantages of the proposed technical solution will be more evident in the following description of a preferred but not exclusive embodiment, represented, by way of non-limiting 35 example, in nos. 3 tables of drawings attached, in which: Figure 1 schematically illustrates the steps of a method for generating a textual document according to the present invention;

• Figure 2 schematically illustrates a phase of elaboration and generation of the method of Figure 1.

• Figure 3 schematically illustrates a phase of elaboration of the method of Figure 1.

Best way to carry out the invention

With reference to the attached figures, a method for generating a destination document 400 has been illustrated, starting from at least one starting document 100, preferably starting from a first starting document 100 and a second starting document 200 and for to facilitate the understanding of the method described, alongside a broader and more general definition of the method, a practical example.

The term document refers to any multimedia content, preferably of a textual type. If the starting document is not textual, it itself must be converted to be suitably processed by the steps of the method described in the present invention. Therefore, in one embodiment, the first and second source documents are a first source text 100 and a second source text 200.

Each of said first and second source texts 100, 200 include a respective first and second plurality of words. For the purposes of this description, the first source text 100 is the following: “The cell is the morphological-functional unit of living organisms. Living beings can be animals or plants. " , while the second source text 200 is the following: “The cell comprises a permeable membrane. The permeable membrane is permeable to oxygen " .

The method comprises a receiving step Fll of the first source text 100. The method comprises a receiving step F12 of the second source text 200. The receiving steps Fll, F12 of the first and second source text 100 and 200 can be also an active phase of a processor, which picks up said first and / or second source text 100, 200 from the network. For example, in one embodiment, the method provides for the reception Fll of the first source text 100, which is manually entered by a user, and a phase of fetching the second source text 200 via the network, also on the basis of the first source text 100.

In both cases, the processor has the first and second source text 100, 200 available.

The method provides for a separation step F21 of the first source text 100 into a first plurality of textual groups 101 (in at least one textual group, in the case where the first source text 100 has only one period). The separation step F21 is performed on the basis of the punctuation of the first source text 100. The processor recognizes the period or comma in the text and separates the textual groups that are delimited by the period or comma.

In particular, in the specific case illustrated, the first starting document 100 will be divided into two textual groups 101, which are:

5 · First textual group of the first source text 100: " The cell is the morphological- functional unit of living organisms. "

• Second textual group of the first source text 100: " Living beings can be animals or plants."

10 The method provides for a separation step F22 of the second source text 200 into a second plurality of textual groups 201 (in at least one textual group, if the second source text 200 has only one period). The separation step F22 is performed on the basis of the punctuation of the second source text 200. The processor recognizes the period or comma in the text and separates the textual groups that are delimited by the period or comma. In particular, in the specific case illustrated, the second 15 starting document 200 will be divided into two textual groups 201, which are:

First textual group of the second source text 200: " The cell comprises a permeable membrane."

Second textual group of the second source text 200: " The permeable membrane is

20 permeable to oxygen

The method provides a linguistic identification step F31 of each textual group 101 of said plurality. The method provides a linguistic identification step F32 of each textual group 201 of said plurality. The linguistic identification phase aims to characterize each single word of the text, to associate it with a 25 function within the textual group and its relationship with the other words in the textual group.

Each linguistic identification phase F31, F32 includes a tokenization or lexical analysis phase. In the tokenization step , each word of the textual group 101, 201 is separated from the others, to generate a plurality of tokens. The token (or lexical token) is defined, in computer science, as a block of categorized text, usually made up of indivisible characters called lexemes. For a broader BO description of tokenization in the lexical sphere, the content included at the following link https://it.wikipedia.Org/wiki/T oken (testo) is included by reference.

The identified tokens are therefore the words that make up the textual group 101, 201.

Each linguistic identification phase F31, F32 includes a syntactic analysis phase (ie a parsing phase). In the syntactic analysis step, each word (each token) of the textual group 101, 201 is associated with a corresponding information datum. The information data defines the role, the syntactic function, of the word (of the token) in the textual group. In other words, information is a label that identifies this role as a noun, adjective, verb, adverb, article or preposition. In one embodiment, the information data includes information about the number and gender of the word (i.e. singular or plural, masculine or feminine).

The syntactic analysis requires the presence of a comparison database, to which to refer to perform the syntactic analysis of each word (each token). Therefore, the processor has access to the reference list, comprising a plurality of groups, each corresponding to a specific language or discipline and including the words of said specific language and / or discipline associated with their syntactic function (associated information data).

The step of parsing comprises a step of recognition language. In the linguistic recognition step, the processor checks for a correspondence between the words of said first plurality of textual groups 101 or said second plurality of textual groups 201 and the words included in the plurality of vocabularies. When the processor identifies a vocabulary in which there are a minimum number of words corresponding to the words of the first plurality of textual groups 101 or of the second plurality of textual groups 201, then it associates the language with said first or second plurality of textual groups 101, 201 of the corresponding vocabulary. Preferably, the processor is configured to carry out the linguistic recognition step for each textual group of said first or second plurality 101, 201 separately. This allows to process a first source text 100 and / or a second source text 200 which are multilingual.

Once the reference vocabulary, that is the language, has been identified, the syntactic analysis phase comprises an extended comparison phase, in which each word of each textual group of said first or second plurality 101, 201 is compared with the reference list of the vocabulary selected for the specific textual group. Upon completion of this comparison, the processor associates the information taken from the vocabulary to each word (token).

Furthermore, the linguistic identification phase F31 comprises a relational identification phase, in which the processor determines, for each word (token), a relational datum, representative of a relationship with the other words (token) of the corresponding textual group.

The relational datum is information about the relationship between the words of the textual group. For example, but not limited to, the relational data of a noun word could be a reference to the adjectives that specify it. Symmetrically, the relational datum of the adjective could be the noun it refers to.

The relational identification phase is performed on the basis of proximity data, identifying a proximity between words (tokens) within the respective textual group.

In applying the above description to the practical case, for the sake of brevity and conciseness, we limit ourselves to selecting, among the textual groups identified, only one textual group for each

IB source document. It is understood that the processor extends this analysis to all textual groups identified regardless of the length of the documents processed. That said, at the end of the linguistic identification phases, the processor obtains the following information: First textual group of the first source document 100:

• Token [1]: La - Information datum : Definite article, Feminine, Singular - Relational datum : Token article [2];

• Token [2]: cell - Information data : Noun, Feminine, Singular - Relational data : Noun connected to token [5];

• Token [3]: it is - Information datum: verb, third person singular, present indicative tense - Relational datum: Verb referring to token [2];

• Token [4]: the - Information datum: Definite article, Feminine, Singular - Relational datum: Token article [5]; · Token [5]: unit - Information data : Noun, Feminine, Singular - Relational data : Noun referring to token [2];

• Token [6]: morphological - Information data : Adjective - Relational data : Adjective referring to token [5];

• Token [7]: functional - Information data : Adjective - Relational data : Adjective referring to token [5];

• Token [8]: of- Information data : articulated preposition, masculine, plural - Relational data : token specification [5], token article [9];

• Token [9]: organisms - Information data : Noun, masculine, plural - Relational data : noun referring to token [5]; · Token [10]: living - Information data : Adjective, Feminine, Singular - Relational data :

Adjective referring to token [9];

First textual group of the second source document 200:

• Token [11]: La - Information datum : Definite article, Feminine, Singular - Relational datum : Token article [12];

• Token [12]: cell - Information data : Noun, Feminine, Singular - Relational data : Noun connected to token [4];

• Token [13]: includes - Information datum: verb, third person, present indicative tense - Relational datum: Verb referring to token [12];

• Token [14]: a - Information datum : Indefinite article, Feminine, Singular - Relational datum : Token article [15];

• Token [15]: membrane - Information data : Noun, Feminine, Singular - Relational data : Noun referring to token [12]; Token [16]: permeable - Information data : adjective - Relational data : adjective referring to token [15],

In one embodiment, in the linguistic identification step F31, F32, the method provides a reduction 5 step. The reduction phase involves the removal, among the identified and identified tokens, of those tokens that have been categorized as verbs and / or as articles and / or as adverbs. Basically, preferably but not necessarily, the processor only keeps the nouns and adjectives connected to it.

This form of implementation of the method allows to reduce the computing and memory resources required, while maintaining the necessary precision in identifying the concepts that can, in the first 10 analysis, be found even only with the use of nouns and adjectives.

In one embodiment, in the linguistic identification phase F31, F32, the method provides for an iterative updating phase (progressive and according to the data present) of the reference list. In particular, when the processor does not identify a correspondence between a word (token) and the words included in the reference list, the processor starts an update phase. The updating phase involves 15 searching the available dictionaries for information data to be associated with the specific word (token) that does not have a correspondence. In the updating phase, the processor receives the information data to be associated with the word through the results of the iterative updating function mentioned above. The processor saves the word associated with the information data received in the reference list, defining a new pair of values, which will be available for future parsing activities. In this way, the 20 method allows to have a system that iteratively and automatically increases its degree of lexical completeness with respect to a language.

It should be noted that, in the event that the word that has no correspondence is not present in the dictionaries used to train the neural network, the processor generates a warning, which indicates that the word is wrong (incorrect linguistic formulation) or that it is not known. This notice allows the user to 25 intervene, if necessary, to specify the nature of the word or to correct it, in the event that it is an error.

It is noted that the separation phases F21, F22 and the linguistic identification phases F31, F32, which have been introduced as consequential, can, in some embodiments, be inverted. In other words, the lexical analysis phase (tokenization) and the syntactic analysis phase (parsing) are performed on the entire text, in order to generate a token also for commas and / or periods (in general for punctuation) BO of the first and / or second source text 100, 200. Following the linguistic identification phase F31, F32, based on the identification of punctuation, the first and / or second source text 100, 200 is divided into first and / or second plurality of textual groups 101, 201.

In one embodiment, the method comprises a coding step F41 of the first plurality of textual groups 101. In one embodiment, the method comprises a coding step F42 of the second plurality of textual groups 201. In each coding step F41 , F42, each textual group of the first or second plurality of textual groups 101, 201 is coded to generate a corresponding identification string. Therefore, at the end of each coding step F41, F42, the processor has available a first plurality of identification strings 101 'and / or a second plurality of identification strings 20G. Therefore, the first and second starting documents 100, 200 are defined by a sequence ordered by the first plurality of identification strings 101 'and of the second plurality of identification strings 20G, respectively.

Each coding step F41, F42 allows sharing the content - of the first or second plurality of textual groups 101, 201 - which is unknown to anyone who does not know the predetermined information protocol. Furthermore, the coding allows to perform logical-mathematical operations on the textual groups under analysis.

From a practical point of view, according to a purely exemplary embodiment, each coding step F41, F42 provides for the use of:

] a content character, for example a number uniquely related to a corresponding word (token); ] a separator character, such as a period, to indicate a separation between a content character and the next content character.

The method provides that each pair included in the reference list is associated with a unique identifier through which it is possible to trace the plain word and its related information data. Therefore, the identification string can be decoded only in the presence of the reference list, through which it is possible to perform a decoding.

Compared to the practical case, considering the form of implementation in which only nouns and adjectives are kept, we will have the following result:

First textual group of the first source document 100:

• Tokens [2]: cell - Given information: Noun, Female, Singular

- Given relational : Noun connected to token [5] - Unique identifier : 23;

• Token [5]: unit - Information data : Noun, Feminine, Singular - Relational data : Noun referring to token [2] - Unique identifier : 35;

• Token [6]: morphological - Information data : Adjective - Relational data : Adjective referring to token [5] - Unique identifier : 1200;

• Token [7]: functional - Information data : Adjective - Relational data : Adjective referring to token [5] - Unique identifier : 1523;

• Token [9]: organisms - Information data : Noun, masculine, plural - Relational data : noun referring to token [5] - Unique identifier : 523;

• Token [10]: living - Information data : Adjective, Feminine, Singular - Relational data : Adjective referring to token [9] - Unique identifier : 123665. Therefore, the identification string of the first textual group of the first source document 100 is: 23.35.1200.1523.523.123665.

5 First textual group of the second source document 200:

Token [12]: cell - Information data : Noun, Feminine, Singular - Relational data : Noun connected to token [4] - Unique identifier : 23;

Token [15]: Membrane - Given information : Noun, Feminine,

10 Singles - Since relational : noun referring to token [12] - Unique identifier :

52223;

Token [16]: permeable - Information data : adjective - Relational data : adjective referring to token [15] - Unique identifier : 23300.

15 Therefore, the identification string of the first textual group of the second starting document 200 is: 23.52223.23300 .

In one embodiment, the method comprises a processing step F4. The processing step F4 can be performed both on the first and second plurality of textual groups 101, 201 and on the first and second 20 plurality of identification strings 101 ', 20G. Preferably, the processing step is performed on the first and second plurality of identification strings 101 ', 20G, which can be managed more easily with the mathematical operators.

In one embodiment, the method comprises a step of training a neural network. The neural network is preferably of the "similarity" type. The previously described neural network is trained on the basis of a 25 specific training database. Such database includes a plurality of vocabularies, each including a plurality of words.

In one embodiment the processing step F4 comprises a step of applying logic and / or algebraic operators on the identification strings of the first starting document 100 and / or of the second starting document 200.

BO It is observed that, in one embodiment, the application step F43 is an automatic application of the operators, according to pre-established rules and conditions. In one embodiment, the application of the operators is instead manual, ie it is discriminated on the basis of input data , identifying a selection of a specific logical-algebraic operator by a user.

In one embodiment, the logical-algebraic operators comprise a group of evaluating operators, which 35 determine a property of an identifying string or a property between two identifying strings, and a group of aggregating operators, which integrate two or more identification strings in order to obtain a resulting identification string. In one embodiment, the step of automatic application of the operators, comprises an affinity evaluation step, in which, for each identifying string 101 ', 20 , an affinity operator (evaluating operator) is applied which calculates an affinity value with each 5 identification string 101 ', 20 of the first and / or second starting document 100, 200. In the affinity evaluation phase, the processor determines pairs of identification strings that have an affinity value greater than a limit affinity value previously set.

Thereafter, the processor is configured to apply one or more aggregating operators on said pairs of identifying strings, to generate a resulting string.

10 In one embodiment, the processor is configured to re-apply the affinity operator on the resulting string with respect to other identifying strings, originating (i.e. not processed with aggregating operators) or resulting (i.e. already processed with aggregating operators). In this way, if two resulting strings have an affinity value greater than the limit affinity value, they can be further aggregated to drive towards a further complex innovative concept .

15 The processor is configured to stop the aggregation process between identifying strings when there are no longer pairs of strings (originating or resulting) that have an affinity value greater than the limit value.

In one embodiment, downstream of the application of the logical operators, the processor generates one or more resulting strings, which are identifying (after decoding) a third textual document.

20 In one embodiment, the valutanti logical operators include an operator of recurrence, which determines a given of recurrence, representative of a number of times in which an identifying string of the first starting document and / or the second source document is retrieved in the first source document and / or in the second source document. The recurrence operator is also applicable on a content character of an identifying string. In this case, it is possible to

25 determine a recurrence of a word in the first and / or second source text.

Therefore, following our example case, in which we have:

. an SI string : 23.35.1200.1523.523.123665 • an S2 string : 23.52223.23300

BO an example of a recurrence operator is:

• Occurrence (23; SI) = 1 , that is, character 23 occurs once in the string SI.

• Occurrence (23; S2) = 1 , i.e. character 23 occurs once in the string S2. In one embodiment, the logical evaluating operators comprise an allocation operator . The allocation operator determines an allocation datum, representative of a position of an identification string of the first starting document or of the second starting document in the first starting document and / or in the second starting document. Even in this case, it is possible to perform the operator allocation on the individual character of content (ie on a single word), to determine the position of the character contained in the first starting document and / or in the second starting document.

• Allocation (SI; Dl) = 1 , i.e. the string SI is positioned at position 1 of the first starting document.

• Allocation (1200; SI) = 3 , i.e. the character 1200 is at the third position of the string SI .

• Allocation (S2; D2) = 1 , i.e. the string S2 is positioned at position 1 of the second starting document.

• Allocation (52223; S2) = 2 , i.e. the 52223 character is at the second position of the S2 string.

In one embodiment, the aggregating logical operators comprise a context operator, which calculates a number of words, whose information data corresponds to a noun (i.e. the number of nouns), present in the first starting document and / or in the second starting document:

Context (SI) = 3 Context (S2) = 2 In one embodiment, the aggregating logical operators comprise an affinity operator. The affinity operator determines, for each identification string of the first starting document, a corresponding affinity value with each identification string of the second starting document. In one embodiment, the affinity operator calculates a resulting affinity value, identifying an affinity between the first starting document and the second starting document, on the basis of the reciprocal affinities between the identifying strings of the first and second document of departure. The affinity values entered in the practical case are arbitrary and depend on correction functions that are progressively trained in the neural network. • Affinity (SI; S2) = 0.75 , i.e. the affinity between the first string SI and the second string S2 is equal to 0.75.

• Affinity (23; 23) = 1 , i.e. the affinity between the two characters 23 (i.e. the word cell) is equal to 0.99999999 (the value is forced to one for convenience).

5 · Affinity (23; 523) = 0.7 , i.e. the affinity between character 23 (i.e. the word cell) and character 523 (i.e. organism) is 0.7.

This is because the neural network, after training, is able to understand that the cell is not a semantically distant noun from the organism, which is made up of cells.

10 In one embodiment, the aggregating operators comprise a replication operator which generates a replicated string, including a replicated starting identifier string for the predefined replication number.

Replication (SI, 2) = 23.35.1200.1523.523.123665.10 23.35.1200.1523.523.123665.

15 In one embodiment, the aggregating operators comprise an extension operator, which receives one or more words, encodes them and generates an extended string, including a string identifying the first or second starting document integrated with said one or more coded words . It is also foreseen that the extension operator receives directly said one or more already coded words.

20 Extension (SI, 52223, 200) = 23.35.1200.1523.523.123665.52223.200, that is, the string SI is expanded to include the words membrane permeable . In one embodiment, the aggregating operators comprise a composition operator, which generates a compound string, starting from a first identifying string of the first or second starting document and a second identifying string of 25 the first or second starting document:

Composition (SI, S2) = 23.35.1200.1523.523.123665.23.52223.200.

In one embodiment, the processor is configured to eliminate, from the compound string, the content BO characters corresponding to repeating nouns. In particular, the processor deletes the character moves relative to the noun and adjectives or nouns referring to it in the queue to adjectives that they relate to the first occurrence of the noun.

Otherwise, the processor does not remove repeating adjectives, as one adjective can refer to multiple nouns. To return to the practical example, the above elimination would be the following result:

35 Reduced composition (SI, S2) = 23.52223.200.35.1200.1523.523.123665. , which, when decoded, corresponds to cell - membrane - permeable - unit - morphological - functional - organisms - living.

In one embodiment, the aggregating operators comprise a graft operator, which integrates an identification string 101 'of the first starting document 100 into the sequence of identifying strings 20 5 of the second starting document 200, and vice versa.

In the practical case illustrated, assuming that the encoding of the second textual group of the second starting document 200 is 3030.456.234 , we will have that the ordered sequence that defines an encoding of the entire second document 200 is: 23.52223.23300.3030.456.234 . So, the graft could be:

10

. Tail coupling (D2; SI) = 23.52223.23300.3030.456.234.23.35.1200.1523.523.123665 • Head engagement (D2; SI)

23.35.1200.1523.523.123665.23.52223.23300.3030.456.234

15 In one embodiment, the method comprises a decoding step F51. In the decoding step F51, the resulting strings 300 are decoded on the basis of the reference list, to allow the generation of a destination text 400 which is usable by the user.

The decoding process F51 is opposite to the coding process F41, F42, ie the processor, starting from the content character, takes the word and the corresponding information data from the reference list. 20 Subsequently, based on the order of the sequence of characters included in the resulting string, the processor generates a sequence of plain words, readable by a user. Therefore, concluding with our practical case, the target document 400 could be, for example, the result of the reduced composition described above, namely: cell - membrane - permeable - unit - morphological - functional - organisms - living.

25 In this way, the user, starting from the two documents, has included in a single text two concepts relating to the cell, namely the fact that it is the morphological-functional unit of living organisms and the fact that there is a permeable membrane. This combination, which appears trivial in such an illustrative case, is absolutely advantageous and potentially fertile in the event that there are a large number of strings and a large number of documents. In such cases, with the affinity criteria that BO the neural network iteratively improves, it is possible to obtain potentially impossible conceptual combinations (sets of words) (by number and combinations) without the combination offered by the algorithmic capabilities conceived and programmed in appropriate computer and computer resources suitably used according to this invention. Industrial applicability

The present invention allows a considerable acceleration in technological progress, and is conveniently applicable with excellent results in a plurality of technical fields, allowing to solve various problems of document and information management and processing. Purely by way of non-limiting example and, while reserving all possible extensions, it should be noted that some fields of application of the proposed method may be the following: automatic generation of new patent texts from structured ideas and technical documents, automatic generation of a range intelligent answers with respect to textual bases with the most varied contents (technical documents, scientific documents, product sheets, test and research reports in accordance with quality regulations) with the help of interfaces based on multichannel dialogic technologies ( textual, vocal or holographic), automatic generation of new texts deriving from the transcription of more or less structured dialogues and from one or more textual extension documents (opinions of users in the customer care field accompanied by any hypertext references or any partial additions, transcription of relevant encounters at corporate or business level, also accompanied by any hypertext references or any partial additions). Particularly and, regardless of the example of implementation shown in the description as well as of the aforementioned application areas, protection is required for any implementation alternatives that can reasonably be deduced or derived from the proposed solution. Such further variants may reasonably be made by those skilled in the art without thereby departing from the invention as it results from the present description and the attached claims. Furthermore, the invention itself can be partially realized and all the various details described can be replaced by technically equivalent elements or solutions.

Claims

1. Method for generating a target text document (400) from a first and a second source document 5 (100, 200), the method comprising the following steps:

- receiving (Fll) of the first starting document (100), including a first group of words;

- receiving (F12) of the second starting document (200), including a second group of words;

- separation (F21) of the first group of words into a first plurality of textual groups (101), on the basis of a punctuation of the first starting document (100);

10 - separation (F22) of the second group of words into a second plurality of textual groups (201), on the basis of a punctuation of the second starting document (200); said method being characterized by the fact that it includes the following steps:

- linguistic identification (F31, F32), in which each word of the textual groups of said first and second 15 plurality of textual groups (101, 201) is associated with:

• an information datum, representative of a function covered by the word within the respective textual group (101, 201), on the basis of a reference list, including a plurality of pairs, each including a

20 word and a corresponding associated information datum ;

• relational data, representative of a relationship between the word and the other words of the corresponding textual group (101, 201), on the basis of proximity data, representative of a relative position of each word with respect to the other

25 words of the corresponding textual group (101, 201);

- processing (F4) of the textual groups of said first plurality and said second plurality (101, 201) by means of one or more predetermined logical operators;

- generation (F5) of a third plurality of textual groups (400), defining the target document (400), on the 30 basis of the processing step (F4).

2. Method according to claim 1, comprising a step of encoding (F41, F42), in which each textual group of said first and second plurality of textual groups (101, 201) is encoded according to a predetermined information transmission protocol, to generate a

35 corresponding identification string (101 ', 20 ), representative of each word, of the information data of each word and of the relational data of each word of the corresponding textual group (101. 201).

3. Method according to claim 2, wherein the identifying string (101 ', 20 ) comprises content characters, each identifier of a corresponding word and its given information, and wherein the content of characters are separated by separator characters, the relational data being defined by a sequential 5 position which is a function of the characters contained in the identification string (101 ', 20 ).

The method according to claim 3, wherein the content character is a number indicating a position of the corresponding word in the reference list.

10 5. Method according to any of claims 2 to 4, wherein in the encoding step (F41, F42), the first textual document (100) and the second textual document (200) are encoded as an ordered sequence of identification strings ( 101 ', 20G) of the textual groups of said first and second plurality of textual groups, respectively.

15 6. Method according to any of claims 2 to 5, wherein in the processing step (F4), said predetermined logical operators include one or more of the following operators:

- an operator of recurrence, the which determines a given recurrence, representative of a number of times in which an identifying string (101 ', 20G) of the first document of departure or the second document of departure (100, 200) is retrieved in the first

20 source document or in the second source document (100, 200);

- an allocation operator, which determines an allocation datum, representative of a position of an identification string (101 ', 20G) of the first starting document or of the second starting document (100, 200) in the first starting document or in the second starting document (100, 200);

- a replication operator, which receives a predefined replication number and generates a replicated 25 string, including a starting identification string (101 ', 102') replicated for the predefined replication number;

- an extension operator, which receives one or more words, encodes said one or more words and generates an expanded string, including an identifying string (101 ', 20G) of the first or second starting document (100, 200) integrated with said one or more coded words;

BO - a composition operator, which generates a compound string, starting from a first identification string (101 ', 20G) of the first or second starting document (100, 200) and a second identification string (101 ', 201 ') of the first or second source document (100,200);

- a context operator, which calculates a number of words, the information of which corresponds to a noun, present in the first starting document and / or in the second starting document (100, 200); - an affinity operator, which determines, for each identification string (101 ') of the first starting document (100), a corresponding affinity value with each identification string (20 ) of the second starting document (200) and in which the affinity operator calculates a resulting affinity value, identifying an affinity between the first starting document and the second starting document (100,

5 200);

- a graft operator, which integrates an identification string (201 ') of the second starting document (200) with the identification strings (10 ) relating to the first starting document (100), in which strings are generated in the processing phase resulting final identifiers (300).

10 7. Method according to claim 6, comprising a step of decoding (F51), wherein said final Identifying resulting strings (300) are decoded, to generate the third plurality of textual groups and define the target document (400).

8. Method according to any one of the preceding claims, comprising a step of allocation, in which each 15 textual group of said first and second pluralities (101, 201) is related to the other textual groups, on the basis of a mutual allocation in the first starting document or in the second starting document (100,

200), to evaluate a relationship between the textual groups of said first and second plurality (101,

201).

20 9. Method according to any one of the preceding claims, comprising a graphical modeling phase, in which, for each textual group of said first or second plurality of textual groups (101, 201), are generated image data on the basis of each word , of the functional data of each word and of the relational data of each word of the corresponding textual group, said image data being representative of a graphic representation of each textual group or of the entire first or second plurality of textual 25 groups (101, 201).

10. Method according to claim 9, wherein the image data are representative of a three-dimensional graphical representation of each textual unit (101, 201).

BO 11. Method according to any one of the preceding claims, wherein the informative data is representative of linguistic label of the respective word, said linguistic label being an article, an adverb, a noun, an adjective or a verb.

12. Method according to any one of the preceding claims, wherein, in the separation step (F21, F22) of the first and second group of words, the textual groups (101, 201) are separated from the identified point and / or by a comma.

13. Method according to any one of the preceding claims, wherein the step of linguistic identification and / or the processing step are performed through artificial intelligence algorithms, implementanti a neural network.

14. Machine for generating a textual destination document starting from a first and a second starting document (100, 200), comprising:

• a memory;

• a user interface, for loading the first source document and the second source document (100, 200);

• a processor, configured to carry out the steps envisaged by the method according to any one of the preceding claims.

Computer program including instructions for carrying out the steps provided by the method according to any one of claims 1 to 13 when executed by the processor of the machine according to claim 14.