WO2021221535A1

WO2021221535A1 - System and method for augmenting a training set for machine learning algorithms

Info

Publication number: WO2021221535A1
Application number: PCT/RU2020/000696
Authority: WO
Inventors: Татьяна Олеговна ШАВРИНА
Original assignee: Публичное Акционерное Общество "Сбербанк России"
Priority date: 2020-04-28
Filing date: 2020-12-16
Publication date: 2021-11-04
Also published as: RU2020132305A; RU2758683C2; RU2020132305A3; EA202092855A1

Abstract

The invention relates to the field of computing. The technical result consists in allowing the selection of text data for the augmentation of a training set, based on characteristics of a text in an input training set. Disclosed is a computer-implemented method for augmenting a training set for machine learning algorithms, including the steps of: obtaining text data from an initial training set; performing data normalization, during which the text is divided into sentences and stripped of symbols; vectorizing the normalized sentences, wherein this transformation includes the breakdown of each sentence obtained into its smallest meaningful parts in the form of words and punctuation marks (tokenization) and the formation of vector representations for each normalized text on the basis of the tokens (meaningful parts) contained therein; generating a text index on the basis of the vector representations of the text data, wherein said text index is generated from a vector space formed by open-source texts and metadata; augmenting the initial training set by selecting relevant vector representations of texts by virtue of determining the similarity measure in the vector space on the basis of a search index.

Description

SYSTEM AND METHOD OF TRAINING SAMPLING AUGMENTATION FOR MACHINE LEARNING ALGORITHMS

FIELD OF TECHNOLOGY

[0001] The present invention relates to the field of computer technology, in particular to solutions for working with machine learning algorithms during the formation of training samples.

LEVEL OF TECHNOLOGY

[0002] Data augmentation can mean an increase in the volume of a training sample in machine learning algorithms, and the increase in volume can be either artificial, produced by modifying the available sample, or by filtering suitable open resources based on the available sample. Currently, the task of augmentation of textual data is required in a wide range of areas and industries related to machine learning. In particular, in the construction of dialogue systems (chat bots, smart assistants), the use of data augmentation makes the systems more resistant to the variability of commands and natural synonyms in speech. [0003] In industrial areas where classification of documents is required, but the industry has accumulated little own text data (or they are not available to developers due to their closed nature - these are medical data, legal documents, government documents), they also resort to data augmentation in order to improve the quality of classification in real-world conditions.

[0004] Also, one of the areas in need of augmented data is information extraction (extraction of named entities and relationships between them). The enormous variability in the names of personalities, company names and locations requires a large volume of training samples and a variety of contexts in which entities are used. Open data in this direction covers only a small part of the possible use cases of entities, and is not sufficient for the industrial implementation of such systems.

[0005] A number of approaches are currently used, each with its own advantages and disadvantages. Random permutations of words in data, random deletions of words, replacement of words with synonyms and morphological analogues.

[0006] A known method of data augmentation (https://arxiv.org/abs/1901.11196 EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks), which is used for tasks where sequence analysis and classification are implemented, but at the same time part of the data becomes difficult to read and understandable for the user, and is not perceived by native speakers as a correct, understandable statement.

[0007] It is also known to apply an ontological / semantic approach. Some words in the data change to more general / particular concepts, which helps the systems to make more general and more accurate particular conclusions, however, it is useful only in problems where stability with respect to the formulation of the sentence / commands / word order / style of expression is not required. A fairly small number of words in a language fall into a structured ontology.

[0008] Automatic translation. Open language-to-language translation systems (Google Translate) are used. The data is translated into several popular languages, then back-translated into the original language. An approach that gives the most complete rephrasing of the initial data, but quite often changes the meaning of the initial statements so far that it increases the noise level of the initial data.

[0009] Thus, a significant drawback of the known approaches is the inability to supplement / correct training samples while maintaining the relevance of the data in relation to the input information, in order to avoid the loss of the semantic component of the text.

SUMMARY OF THE INVENTION

[0010] The solution to the existing technical problem in the art is to create a data augmentation system based on the analysis of data distribution by generating a global text index supplemented from open data sources.

[0011] The technical result is to ensure the selection of text data for augmentation of the training sample based on the characteristics of the text of the input training sample.

[0012] The claimed result is achieved using a training sample augmentation system for machine learning algorithms, which contains: at least one processor; at least one memory means; an input data processing module, configured obtaining text data that form the initial training sample; data normalization, in which the text is divided into sentences and the text is cleared of characters; a data vectorization module capable of converting normalized sentences into vector form, while during said conversion, each received sentence is divided into minimum significant parts, which are words and punctuation marks; tokenization of the mentioned minimum significant parts; formation of vector representations for each token; and generating an averaged vector representation of the normalized sentence; a text data enrichment module containing a set of text data collected from open sources and metadata for their vectorization and building a search index; a text index module, configured to generate a text index based on vector representations of text data; a training sample augmentation module, configured to supplement and / or adjust the original text sample based on the selection of relevant vector representations of tokens in the text data enrichment module by determining the proximity measure of tokens in the vector space.

[0013] In one of the particular examples of the system implementation, the data vectorization module generates an averaged vector representation of the text.

[0014] In another particular example of the system implementation, the dimension of the averaged vector representation is 768: 1.

[0015] In another particular example of the implementation of the system, the metadata includes at least one of: link source in the global Internet, date of source, genre, date of creation, author data, heading, subject, number of words in the source.

[0016] In another particular example of the implementation of the system, the measure of proximity of tokens and texts in space is a cosine measure of proximity. [0017] In another particular example of system implementation in vector space, each token has unique coordinates.

[0018] In another particular example of the implementation of the system, based on the coordinates, the minimum and maximum boundary values of the text space of the initial training sample are determined.

[0019] In another particular example of the implementation of the system, the training sample is augmented by adding new texts with coordinates that do not go beyond the boundary values.

[0020] In another particular example of the implementation of the system, the initial training sample is supplemented to a user-specified number of words.

[0021] In another particular example of the implementation of the system, an iterative search for the nearest texts in the vector space is carried out for each text from the sentences of the initial selection.

[0022] In another particular example of the implementation of the system, the uniqueness of the selected texts is determined based on the metadata stored in the text data enrichment module.

[0023] The claimed solution is also carried out using a computer-implemented method for augmentation of a training sample for machine learning algorithms, the method being performed using at least one processor and contains the stages at which: receive text data of the original training sample; perform data normalization, in which the text is divided into sentences and the text is cleared of characters; vectorization of normalized sentences is performed, while during the mentioned transformation the following is carried out: splitting each received sentence into minimum significant parts, which are words and punctuation marks (tokenization); formation of vector representations for each normalized text based on the tokens (significant parts) included in it; form a text index based on vector representations of text data, while the text index is formed from a vector space formed from texts located in open sources and metadata; augmentation of the initial training sample is carried out using the selection of relevant vector representations of the texts based on the determination of the measure of proximity in the vector space based on the search index.

[0024] In one of the particular examples of the implementation of the method, when vectorizing text data, an averaged vector representation of the text is generated.

[0025] In another particular embodiment of the method, the dimension of the averaged vector representation is 768: 1.

[0026] In another particular embodiment of the method, the metadata includes at least one of: link source in the global Internet, date of source, genre, date of creation, author data, heading, subject, number of words in the source. [0027] In another particular embodiment of the method, the measure of proximity of tokens and texts in space is a cosine measure of proximity.

[0028] In another particular embodiment of the method in vector space, each token has unique coordinates.

[0029] In another particular embodiment of the method, based on the coordinates, the minimum and maximum boundary values of the text space of the initial training sample are determined.

[0030] In another particular embodiment of the method, the training sample is augmented by adding new texts having coordinates that do not go beyond the boundary values.

[0031] In another particular embodiment of the method, the initial training sample is supplemented to a user-specified number of words.

[0032] In another particular embodiment of the method, an iterative search of the nearest texts in the vector space is carried out for each text from the sentences of the initial selection.

[0033] In another particular embodiment of the method, the uniqueness of the selected texts is determined based on metadata.

BRIEF DESCRIPTION OF DRAWINGS

[0034] FIG. 1 illustrates an example of the claimed system.

[0035] FIG. 2 illustrates a block diagram of the claimed method.

[0036] FIG. 3 illustrates a general view of a computing device. CARRYING OUT THE INVENTION

[0037] The claimed solution is implemented using the computer system (100) shown in FIG. 1, which may be executed on a computing device such as a personal computer, server, or the like. The training sample augmentation system includes the main functional elements, such as: the input data processing module (101), the vectorization module (102), the data enrichment module (103), the text index module (104) and the augmentation module (105).

[0038] The input data processing module (101) includes preprocessing user texts sent to the augmentation system. Also, the module (101) performs their cleaning and transformation into a common space of numerical features. [0039] The input text data is divided into sentences. Existing open technologies make it possible to carry out this operation for the Russian language without additional development. The input text sample format is usually .txt. The division of the received text into sentences is carried out using open libraries in the python3 language (for example, https: //pypi.orn/proiect/rusenttokenize/). Also, using the module (101), the input selection sentences are divided into tokens by splitting sentences by spaces and separating punctuation marks from them.

[0040] At the output of the input data processing module (101), a list of the offer and the tokens therein is generated.

Example:

“All people are mortal. Socrates is a man. Therefore Socrates is mortal. "

["All people are mortal.", "Socrates is a man.", "Therefore, Socrates is mortal." ]

[0041] Next, the module (101) clears texts from special characters. Since for the vector space it is necessary to represent the text as a point in the multidimensional space of the features of words (vector representations), then special characters that are not related to letters, numbers and punctuation marks can introduce noise into this vector and shift the position of the text in the space of features relative to others, which is critical for the final quality of the selection and adjustment of the text selection during augmentation.

[0042] By processing the input information by the module (101), incoming sentences are filtered from special characters that are not included in the list of Cyrillic and Latin letters, numbers and symbols from a standard 105-key keyboard. Such cleaning allows you to clear the text of noise that will introduce unknown universal model rare symbols, and to make the resulting vectors more accurate. Filtering is done using regular expressions.

Example:

“· _Mama_ was washing the frame. © "-" "Mom washed the frame."

At the output of the module (101), a list of sentences of the input training sample, cleared of special characters, is obtained.

[0043] The vectorization module (102) is one or more machine learning models for converting textual information into a vector form - embedding. The cleaned sentences of the text obtained using the module (101) are subject to vectorization. Machine learning models based on word-by-word vectorization or obtaining the vector of the entire sentence context as a whole can be applied. [0044] In the vectorization module (102), it is preferable to use machine learning models, for example, artificial neural networks (ANNs), which are capable of making a generalized conclusion about the world, trained on a large amount of closed data (texts with tens of billions of words - usually news, blogs , literature, including technical, open encyclopedias), for processing and analyzing the properties of new texts. Models such as BERT, ELMo, ULMFit, XLNet, RoBerta and others are already successfully used for the Russian language in small data processing tasks. By using one or more of the above solutions, the module (102) can generate vector representations of texts and sentences in (embeddings). Embeddings obtained on the basis of a universal model with generalized knowledge about the variability of texts allow us to assess their position in the multidimensional space of text properties in general, and to supplement the sample with texts similar in their numerical characteristics to the original texts of the user's training sample.

[0045] As an example, we can consider the application of the BERT model for the Russian language (http://docs.deeppaylov.ai/en/master/features/pretrained vectors.html). The model acts as a source for receiving offer embeddings. The vectorization module (102), based on the normalized text data of the input training sample received from the module (101), splits each sentence into the least significant parts - tokens (words, punctuation marks).

[0046] Tokenization (the division of text into tokens) occurs using an open technology suitable for the BERT model, for example, BertTokenizer (see. https://pypi.org/proiect/pytorch-pretrained-bert/ ^' ). Based on the results of tokenization, a list of strings corresponding to the offer tokens is generated. For each token, using the vectorization module (102), an embedding is transmitted, which is taken from the last - 1 loro layer of the BERT model. The embedding has a dimension of 768 by 1. For each sentence, a corresponding embedding of a given dimension is formed (in this solution, the vector dimension is 1 by 768) using a neural network model. In particular, this embedding can be formed using the token averaging operation.

[0047] The data enrichment module (103) is a database with texts from open data sources, for example, web resources with various versions of texts, literature, and the like. Module (103) contains texts with a total volume of 10 billion words, while it is designed with the possibility of constant filling, which provides a large variability of the contexts of material in Russian, taking into account various styles, genres and types of materials.

[0048] The information contained in the enrichment module (103) serves as a source material, a text corpus for creating a full-fledged index of natural texts, the materials of which will be supplemented by the transmitted sample. In addition to the texts themselves, the module (103) stores available metadata about the text, such as:

- Identifier (ID);

- Information about the source, the address of its location on the Internet (url, ip-address, etc.)

- Date added to storage;

- Genre;

- Date of writing;

- full name of the author;

- Heading, subject;

- Word count.

[0049] The text index module (104) generates a hierarchical index based on the previously vectorized texts from the module (103). Vectorization of text data in the module (103) is carried out using the vectorization module (102). The index is built using the library (https://pypi.org/proiect/nmslib/ ').

[0050] This library has indexing methods that are most suitable for building an index on embeddings: you can build a hierarchical index, selecting the most similar text based on the cosine measure. This measure of proximity is a popular metric used to obtain language objects (words, sentences, texts) that are as similar as possible in their properties encoded in embeddings.

[0051] The cosine measure of proximity is determined using the dot product and the norm between two vectors:

[0052] The wide applicability of the cosine measure, in particular, in the problems of information retrieval, machine learning and text processing is due to its effectiveness as an evaluative measure for sparse vectors / embeddings, since only non-zero values of embeddings must be taken into account (and such zero values in text embeddings it is enough, as it means that some feature is absent in the text).

[0053] The cosine measure is just a particular example of the indexing method, it can be anything. In this case, it is appropriate to use a hierarchical index due to the fact that it is quite compact and at the same time provides quick retrieval of the nearest objects by embeddings. It is potentially possible to use any other methods for constructing an index on a cosine measure (sparse cosine similarity indexing), but due to the considerable dimension of embeddings (usually they include sequences from 300 to 2000 numbers, in the stated solution - 768), hierarchical methods perform the fastest search the closest object in the index to the query object.

[0054] As part of the experiment, a test index was collected, built on 100,000 random sentences from Russian Wikipedia and the Common Crawl web corpus (blogs, news, advertising). The headlines of news and popular blog posts were collected and the most similar sentences from the test index were selected for them: in the examples below, you can observe how the theme, emotional coloring of sentences, style and lexical signs are preserved in the selected sentences.

[0055] For the full index, an index is created on data from the open web corpus Omnia Russica of 33 billion words in Russian (compiled by the author of this application) https://omnia-russica.github.io/.

[0056] The sample augmentation module (105) is a set of models for determining the completeness of the sample obtained by the module (101). For the subsequent augmentation of the initial training sample, module (105) can operate in two modes of operation: 1) Creation of an adjusted and / or augmented sample;

2) Completion of the sample to the required number of words.

[0057] The increase in the sample to the required size is carried out based on user input, which indicates the desired sample size in words, which allows you to reach the maximum value attainable at this index, for example, if the user wants 1 billion words, but there are only 20 million, 20 million are issued ... [0057] FIG. 2 shows a block diagram of the implementation of the method for augmentation of the training sample (200). At the first stage (201), the user loads the original text sample into the system, which is processed by the module (101) and subsequently converted into a vector representation (202).

[0058] In the case of adjusting the sample, the following operation occurs:

Using the obtained vectors of the input sample (texts received from the user for augmentation), extreme values are calculated for each embedding variable - minimum and maximum, for each of the 768 variables in the embedding. The obtained 768 minimum and maximums form a hyperspace in the feature space of the augmentation model used by the module (105).

[0059] From the generated hierarchical index (203), all texts are extracted, the embeddings of which fall into the said hyperspace, i.e. embeddings that satisfy the minimum and maximum conditions for each variable in the coordinate space. A list of such example sentences is displayed in text form. The augmentation of the sample (204) in terms of improving (adjusting) the sample is achieved by enriching it with new examples that are not distinguished by extreme values, while allowing a more accurate understanding of the distribution of the phenomena of interest to the user vector space, etc.

[0060] The augmentation of the sample (204) in terms of its completion to the required number of words is carried out as follows. Using the received vectors of input selection tokens (texts received from the user for augmentation), extreme values are calculated for each embedding variable, similarly to the method mentioned above to improve the selection, which form a vector hyperspace of text data.

[0061] From the generated text index (203), all texts are extracted, whose embeddings fall into this hyperspace, which makes it possible to estimate the volume of the received text sample. [0062] If the volume of the text sample is less than the number of words declared by the user, then the following operation occurs: the index selects according to N (starting with N = l) sentences that are maximally close in cosine measure to each sentence from the obtained sample, even if they are not included in a certain hyperspace. By iteratively increasing the number N by one, all sentences are looped through to find unique similar texts until the number of words reaches the number set by the user. The uniqueness of the examples is controlled by checking the id of the proposal in the module base (103).

[0063] If the sample is less than the number of words declared by the user, then the following operation occurs: all examples obtained from the feature hyperspace are sorted by similarity based on the calculation of the cosine measure of proximity to the examples in the user sample. Each example from the user selection is looped through, and the N closest examples are selected for it. The parameter N is iteratively increased by 1 until the number of words in the resulting sample is the declared number.

[0064] Execution of the sample augmentation method (200) allows selecting the most relevant text data existing in the constantly generated space of the hierarchical text index, which are used to enrich the user's input training sample.

[0065] The claimed solution can be embedded in other systems to improve their work, for example, a system for automatic marking of entities in the text (named entity recognition task - entities mean persons, locations, names of organizations, sometimes additional entities; the task is complex, so how to solve it requires the selection of a large number of marked examples). When working as part of a system for marking up entities, the user loads unlabeled data and examples of entities, then the data is artificially augmented according to the method described above (200), the marking of entities takes into account a larger number of contexts that are formed during sample augmentation.

[0066] By itself, the idea of searching for additional data is often done manually on a limited set of open sources. However, this approach absolutely does not take into account the variability in the original textual data, since the text should still be considered mathematically as a sequence of rare events with a large number of factors affecting the distribution - style, genre, source, purpose and date of writing, the relationship of the author with the addressee, and etc. Adding heterogeneous text data to the original sample can completely neutralize its features and worsen learning outcomes. With the help of the implementation of the claimed approach, the process of searching for a suitable complementary homogeneous sample is automated, while taking into account the variability of the features of the text.

[0067] FIG. 3 shows a general view of the computing device (300). On the basis of the device (300), a user device for generating and loading a sample, a computing device (100) for performing the augmentation method (200) and other unrepresented devices can be implemented that can participate in the general information architecture of the claimed solution.

[0068] In the General case, the computing device (300) contains one or more processors (301) united by a common bus of information exchange, memory means such as RAM (302) and ROM (303), input / output interfaces (304), devices input / output (305), and a device for networking (306).

[0069] The processor (301) (or multiple processors, multi-core processor) can be selected from a range of devices currently widely used, for example, Intel ™, AMD ™, Apple ™, Samsung Exynos ™, MediaTEK ™, Qualcomm Snapdragon ™ and etc. The processor (301) can also include a graphics processor or work in conjunction with a graphics accelerator, for example, Nvidia, AMD Radeon, etc., which can be used to perform computational operations when executing machine learning algorithms.

[0070] RAM (302) is a random access memory and is intended for storing machine-readable instructions executed by the processor (301) for performing the necessary operations for logical data processing. RAM (302), as a rule, contains executable instructions of the operating system and corresponding software components (applications, software modules, etc.).

[0071] ROM (303) is one or more persistent storage devices, such as a hard disk drive (HDD), solid state data storage device (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.

[0072] Various types of I / O interfaces (304) are used to organize the operation of the components of the device (300) and to organize the operation of external connected devices. The choice of the appropriate interfaces depends on the specific design of the computing device, which can be, but are not limited to: PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0073] To ensure user interaction with the computing device (300), various I / O means (305) are used, for example, a keyboard, display (monitor), touch display, touch pad, joystick, mouse manipulator, light pen, stylus, touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, light indicators, projector, camera, biometric identification (retina scanner, fingerprint scanner, voice recognition module), etc. [0074] The networking means (306) allows the device (300) to transmit data via an internal or external computer network, for example, Intranet, Internet, LAN, and the like. One or more means (306) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and / or BLE module, Wi-Fi module and dr.

[0075] In addition, satellite navigation aids can also be used as part of the device (300), for example, GPS, GLONASS, BeiDou, Galileo.

[0076] The presented application materials disclose preferred examples of the implementation of the technical solution and should not be construed as limiting other, particular examples of its implementation, not going beyond the scope of the claimed legal protection, which are obvious to specialists in the relevant field of technology.

Claims

FORMULA

1. A system for augmentation of a training sample for machine learning algorithms, comprising: at least one processor; at least one memory means; an input data processing module, configured to obtain text data that form an initial training sample; data normalization, in which the text is divided into sentences and the text is cleared of characters; a data vectorization module capable of converting normalized sentences into vector form, while during said conversion, each received sentence is divided into minimum significant parts, which are words and punctuation marks; tokenization of the mentioned minimum significant parts; formation of vector representations for each token; and generating an averaged vector representation of the normalized sentence; a text data enrichment module containing a set of text data collected from open sources and metadata for their vectorization and building a search index; a text index module, configured to generate a text index based on vector representations of text data; a training sample augmentation module, configured to supplement and / or adjust the original text sample based on the selection of relevant vector representations of tokens in the text data enrichment module by determining the measure of proximity of tokens in the vector space.

2. The system according to claim 1, characterized in that the data vectorization module generates an averaged vector representation of the text.

3. The system according to claim 2, characterized in that the dimension of the averaged vector representation is 768: 1.

4. The system according to claim 1, characterized in that the metadata includes at least one of: link source in the global Internet, date of source, genre, date of creation, author data, heading, subject, number of words in the source.

5. The system according to claim 1, characterized in that the measure of proximity of tokens and texts in space is a cosine measure of proximity.

6. The system according to claim 1, characterized in that each token has unique coordinates in the vector space.

7. The system according to claim 6, characterized in that, based on the coordinates, the minimum and maximum boundary values of the text space of the initial training sample are determined.

8. The system according to claim 7, characterized in that the training sample is augmented by adding new texts with coordinates that do not go beyond the boundary values.

9. The system according to claim 8, characterized in that the initial training sample is supplemented up to a user-specified number of words.

10. The system according to claim 9, characterized in that an iterative search of the nearest texts in the vector space is carried out for each text from the sentences of the initial selection.

11. The system according to claim 10, characterized in that the uniqueness of the selected texts is determined based on the metadata stored in the text data enrichment module.

12. A computer-implemented method for augmentation of a training sample for machine learning algorithms, performed using at least one processor and containing the stages at which: receive text data of the original training sample; perform data normalization, in which the text is divided into sentences and the text is cleared of characters; vectorization of normalized sentences is performed, while during the mentioned transformation the following is carried out: splitting each received sentence into minimum significant parts, which are words and punctuation marks (tokenization); formation of vector representations for each normalized text based on the tokens (significant parts) included in it; form a text index based on vector representations of text data, while the text index is formed from a vector space formed from texts located in open sources and metadata; augmentation of the initial training sample is carried out using the selection of relevant vector representations of the texts based on the determination of the measure of proximity in the vector space based on the search index.

13. The method according to claim 12, characterized in that when the text data is vectorized, an averaged vector representation of the text is formed.

14. The method according to claim 13, characterized in that the dimension of the averaged vector representation is 768: 1.

15. The method according to claim 12, characterized in that the metadata includes at least one of: link source in the global Internet, date of source, genre, date of creation, author data, heading, subject, number of words in the source.

16. The method according to claim 12, characterized in that the measure of proximity of tokens and texts in space is a cosine measure of proximity.

17. The method according to claim 12, characterized in that each token has unique coordinates in the vector space.

18. The method according to claim 17, characterized in that, based on the coordinates, the minimum and maximum boundary values of the text space of the initial training sample are determined.

19. The method according to claim 18, characterized in that the training sample is augmented by adding new texts with coordinates that do not go beyond the boundary values.

20. The method according to claim 19, characterized in that the initial training sample is supplemented up to a user-specified number of words.

21. The method according to claim 20, characterized in that an iterative search of the nearest texts in the vector space is carried out for each text from the sentences of the initial selection.

22. The method according to claim 21, characterized in that the uniqueness of the selected texts is determined based on metadata.