US20230131259A1 - Apparatus and method of training machine learning model, and apparatus and method for summarizing document using the same - Google Patents
Apparatus and method of training machine learning model, and apparatus and method for summarizing document using the same Download PDFInfo
- Publication number
- US20230131259A1 US20230131259A1 US17/968,193 US202217968193A US2023131259A1 US 20230131259 A1 US20230131259 A1 US 20230131259A1 US 202217968193 A US202217968193 A US 202217968193A US 2023131259 A1 US2023131259 A1 US 2023131259A1
- Authority
- US
- United States
- Prior art keywords
- document
- sentence
- machine learning
- token
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
Abstract
An apparatus of training a machine learning model includes a preprocessing module segmenting a document into each sentence and performing tokenization to generate a token sequence for the document, wherein a document representative token representing the document and representative sentence tokens representing each sentence are included in the token sequence for the document, a first training module training the machine learning model to predict an order of sentences in the document, based on the token sequence for the document, and a second training module training the machine learning model to perform document similarity maximization based on the token sequence for the document.
Description
- This application claims the benefit under 35 USC § 119 of Korean Patent Application No. 10-2021-0140881 filed on Oct. 21, 2021 in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference in its entirety.
- The present disclosure relates to a training technology of a machine learning model and a technology for summarizing a document using the same.
- In computer science, natural language understanding (NLU) means that a computer receives a sentence formulated in a natural language (for example, Korean, Japanese, English, etc.) commonly used by humans for communication, and infers an intention of the input sentence. There are various technologies for understanding natural language on a computer, but recently, a technology using an artificial intelligence model based on machine learning has been mainly studied.
- Among these artificial intelligence language models, an extractive summary task extracts important sentences from a document including several sentences. Existing extractive summary tasks mainly use a method of labeling in which a person directly writes a correct answer for most important sentences in a given document and then training a machine learning model using corresponding data set.
- However, the existing method includes a subjectivity factor of the person (that is, annotator) labeling the correct answer, which not only causes bias in the data set itself but also extracts only sentences corresponding to a predetermined number of sentences when constructing the data set, and thus, it is difficult to train to flexibly select the number of important sentences. For example, in a dataset A, only the most important sentence is labeled as a correct answer, and if a machine learning model is trained with the dataset A, the machine learning model may extract only the most important sentence from input document and it may be difficult for the machine learning model to extract a second or third most important sentence from the document.
- In addition, since most of the documents used as data sets are documents in which the most important sentences are not directly labeled as correct answers, it takes a lot of time and cost to train the machine learning model according to the existing method.
- Exemplary embodiments provide an apparatus and method of training a machine learning model, capable of extracting and summarizing various numbers of sentences from a document, while excluding a specific person's subjectivity factor in document summary, and an apparatus and method for summarizing a document using the same.
- According to an aspect of the present disclosure, an apparatus of training a machine learning model, including: a preprocessing module segmenting a document into each sentence and performing tokenization to generate a token sequence for the document, wherein a document representative token representing the document and representative sentence tokens representing each sentence are included in the token sequence for the document; a first training module training the machine learning model to predict an order of sentences in the document, based on the token sequence for the document; and a second training module training the machine learning model to perform document similarity maximization based on the token sequence for the document.
- The first training module may rearrange an order of the sentence token sequences and input the rearranged sentence token sequences to the machine learning model.
- A representative sentence token for a corresponding sentence may be located at the front of the sentence token sequence, and the machine learning model may embed the representative sentence tokens in the rearranged sentence token sequences to generate embedding vectors, respectively, and predict the order of sentences in the document, based on the generated embedding vectors.
- The first training module may calculate a first error by comparing a predicted sentence order output from the machine learning model with an original sentence order of the document and adjust a weight of the machine learning model to minimize the first error.
- The second training module may input two documents that differ in the order of sentences to the machine learning model, and train the machine learning model so that a difference between the two documents is minimized.
- The second training module may primarily rearrange the order of the sentence token sequences in the document, and input the primarily rearranged sentence token sequences and a first document representative token representing the primarily rearranged document to the machine learning model.
- The second training module may secondarily rearrange the sentence token sequences in the document so that the order thereof is different from the order of the primary rearrangement, and input the secondarily rearranged sentence token sequences and a second document representative token representing the secondarily rearranged document to the machine learning model.
- The machine learning model may embed the first document representative token to generate a first embedding vector, and embed the second document representative token to generate a second embedding vector.
- The second training module may calculate a second error through a difference between the first embedding vector and the second embedding vector, and adjust a weight of the machine learning model to minimize the second error.
- A loss function (Loss) of the machine learning model may be expressed by the following equation:
-
Loss=LossSOP+α·LossDSM (Equation) -
- LossSOP: Loss function for sentence order prediction
- LossDSM: Loss function for document similarity maximization
- α: normalization parameter
- According to another aspect of the present disclosure, a method of training a machine learning model includes: segmenting, by a preprocessing module, a document into each sentence and performing tokenization to generate a token sequence for the document, wherein a document representative token representing the document and representative sentence tokens representing each sentence are included in the token sequence for the document; training, with a first training module, the machine learning model to predict an order of sentences in the document, based on the token sequence for the document; and training, with a second training module, the machine learning model to perform document similarity maximization based on the token sequence for the document.
- The training of the machine learning model to predict the order of sentences may include rearranging the order of the sentence token sequences in the document and inputting the rearranged sentence token sequences into the machine learning model, calculating a first error by comparing a predicted sentence order output from the machine learning model with an original sentence order of the document, and adjusting a weight of the machine learning model to minimize the first error.
- A representative sentence token for a corresponding sentence may be located at the front of the sentence token sequence, and the machine learning model may embed the representative sentence tokens in the rearranged sentence token sequences to generate embedding vectors, respectively, and predict the order of sentences in the document, based on the generated embedding vectors.
- The training of the machine learning model to perform document similarity maximization may include inputting two documents that differ in the order of sentences to the machine learning model and training the machine learning model so that a difference between the two documents is minimized.
- The training of the machine learning model to perform document similarity maximization may include primarily rearranging the order of the sentence token sequences in the document, inputting the primarily rearranged sentence token sequences and a first document representative token representing the primarily rearranged document to the machine learning model, secondarily rearranging the sentence token sequences in the document so that the order thereof is different from the order of the primary rearrangement, and inputting the secondarily rearranged sentence token sequences and a second document representative token representing the secondarily rearranged document to the machine learning model.
- The machine learning model may embed the first document representative token to generate a first embedding vector, and embed the second document representative token to generate a second embedding vector.
- The training of the machine learning model to perform document similarity maximization may include calculating a second error through a difference between the first embedding vector and the second embedding vector and adjusting a weight of the machine learning model to minimize the second error.
- A loss function (Loss) of the machine learning model may be expressed by the following equation:
-
Loss=LossSOP+α·LossDSM (Equation) -
- LossSOP: Loss function for sentence order prediction
- LossDSM: Loss function for document similarity maximization
- α: normalization parameter
- According to another aspect of the present disclosure, an apparatus for summarizing a document includes a preprocessing module segmenting a document into each sentence and performing tokenization to generate a token sequence for the document, wherein a document representative token representing the document and representative sentence tokens representing each sentence are included in the token sequence for the document, a machine learning module including a machine learning model receiving the token sequence for the document and embedding the document representative token and the representative sentence tokens of each sentence to output a document representative embedding vector and each representative sentence embedding vector, and a summary extracting module calculating a similarity between the document representative embedding vector and each representative sentence embedding vector and summarizing the document according to the calculated similarity.
- The preprocessing module may locate the document representative token at the front of the document, and locate each representative sentence token at the front of a corresponding sentence.
- The machine learning model may be trained to predict the order of sentences in the document based on a token sequence for the document, and may be trained to perform document similarity maximization based on the token sequence for the document.
- The summary extracting module may summarize the document by extracting a sentence having the similarity greater than or equal to a preset threshold value.
- According to another aspect of the present disclosure, a method for summarizing a document includes segmenting, by a preprocessing module, a document into each sentence and performing tokenization to generate a token sequence for the document, wherein a document representative token representing the document and representative sentence tokens representing each sentence are included in the token sequence for the document, receiving, by a machine learning module including a machine learning model, the token sequence for the document and embedding the document representative token and the representative sentence tokens of each sentence to output a document representative embedding vector and each representative sentence embedding vector, and calculating, by a summary extracting module, a similarity between the document representative embedding vector and each representative sentence embedding vector and summarizing the document according to the calculated similarity.
- The above and other aspects, features, and advantages of the present disclosure will be more clearly understood from the following detailed description, taken in conjunction with the accompanying drawings, in which:
-
FIG. 1 is a block diagram illustrating an apparatus of training a machine learning model according to an exemplary embodiment in the present disclosure; -
FIG. 2 is a diagram illustrating a state in which a preprocessing module performs tokenization on a document in an exemplary embodiment in the present disclosure; -
FIG. 3 is a diagram illustrating a state in which a first training module trains a machine learning model in an exemplary embodiment in the present disclosure; -
FIG. 4 is a diagram illustrating a state in which a second training module trains a machine learning model in an exemplary embodiment in the present disclosure; -
FIG. 5 is a flowchart illustrating a method of training a machine learning model according to an exemplary embodiment in the present disclosure; -
FIG. 6 is a block diagram illustrating a configuration of an apparatus for summarizing a document according to an exemplary embodiment in the present disclosure; -
FIG. 7 is a diagram illustrating a process of summarizing a document by an apparatus according to an exemplary embodiment in the present disclosure; -
FIG. 8 is a flowchart illustrating a method for summarizing a document based on machine learning according to an exemplary embodiment in the present disclosure; and -
FIG. 9 is a block diagram illustrating a computing environment including a computing device suitable for use in exemplary embodiments. - Hereinafter, exemplary embodiments of the present disclosure are described with reference to the accompanying drawings. The following description is provided to aid in a comprehensive understanding of methods, devices, and/or systems disclosed in the particularities. However, the following description is merely exemplary and is not provided to limit the present disclosure.
- In the following description of the present disclosure, a detailed description of known functions and configurations incorporated herein will be omitted when it would render the subject matter of the present disclosure unclear. The terms used in the present specification are defined in consideration of functions used in the present disclosure, and may be changed according to the intent or conventionally used methods of clients, operators, and users. Accordingly, definitions of the terms should be understood on the basis of the entire description of the present specification. Terms used in the following description are merely provided to describe exemplary embodiments of the present disclosure and are not intended to be limiting of the inventive concept. As used herein, the singular forms “a,” “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” or “has” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, or portion or combination thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, or a portion or combination thereof.
- It will be understood that, although the terms first, second, etc. May be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure.
-
FIG. 1 is a block diagram illustrating an apparatus of training a machine learning model according to an exemplary embodiment in the present disclosure. - Referring to
FIG. 1 , anapparatus 100 of training a machine learning model may include apreprocessing module 102, afirst training module 104, and asecond training module 106. - Here, a
machine learning model 110 is a model trained by theapparatus 100 and may be a model for performing an extractive summary task of extracting an important sentence from an input document. In an exemplary embodiment, as themachine learning model 110, an artificial neural network model such as long short term memory (LSTM), gated recurrent unit (GRU), bidirectional encoder representations from transformers (BERT), etc. may be used, but is not limited thereto. - The
preprocessing module 102 may segment a document into each sentence, and perform tokenization on each sentence to generate tokens of a preset unit. For example, thepreprocessing module 102 may generate tokens in units of morphemes by performing morpheme analysis on each sentence. Here, it is described that thepreprocessing module 102 tokenizes each sentence in units of morphemes, but the present disclosure is not limited thereto, and tokenization may be performed in units of words or syllables, and tokenization may be performed in other preset units. - In this case, from the document, the
preprocessing module 102 may extract a document representative token representing the corresponding document. Also, from each sentence, thepreprocessing module 102 may extract a representative sentence token representing the corresponding sentence. -
FIG. 2 is a diagram illustrating a state in which thepreprocessing module 102 performs tokenization on a document in an exemplary embodiment in the present disclosure. Referring toFIG. 2 , thepreprocessing module 102 may add a document representative token D at the front of a token sequence for a document (i.e., a token sequence for the entire document). Thepreprocessing module 102 may add a representative sentence token SR at the front of the token sequence for a corresponding sentence. Here, each sentence (S1, S2, . . . , Sn) (n is the number of sentences in the document) may include a representative sentence token SR and tokens (T1, T2, . . . , Tm) (where m is the number of tokens in the corresponding sentence). - The
first training module 104 may serve to train themachine learning model 110 to predict the order of sentences in the document. That is, thefirst training module 104 may train themachine learning model 110 to perform a sentence order prediction (SOP) task. -
FIG. 3 is a diagram illustrating a state in which thefirst training module 104 trains themachine learning model 110 in an exemplary embodiment in the present disclosure. Referring toFIG. 3 , thefirst training module 104 may randomly change (rearrange) the order of sentences in a document and input the rearranged sentences to themachine learning model 110. - That is, the
first training module 104 may randomly change the order of the token sequence (including the representative sentence token SR and the tokens T1, T2, . . . , Tm) for each sentence in the document. Hereinafter, a token sequence for a predetermined sentence may be referred to as a sentence token sequence. In this case, thefirst training module 104 may rearrange the order of the sentence token sequences in the document except for the document representative token D located at the front of the document. - The
first training module 104 may input the rearranged sentence token sequences to themachine learning model 110. In this case, thefirst training module 104 may label the order of the sentences in the document in the original state before rearrangement as a correct answer value. - Here, the
machine learning model 110 may embed the representative sentence token SR located at the front in the rearranged sentence token sequences to generate an embedding vector, and predict the order of the sentence token sequences in the document (i.e., the order of the sentences in the document) based on the generated embedding vectors. That is, in a state in which the order of the sentences in the document is randomly changed, themachine learning model 110 may be provided to predict the original order of the sentences in the document. - The
first training module 104 may calculate a first error by comparing the order of the sentence token sequences in the document predicted by the machine learning model 110 (that is, the predicted sentence order) with the original order of the sentence token sequences in the document (that is, the original sentence order). - In an exemplary embodiment, the
first training module 104 may calculate an error by comparing the predicted sentence order with the original sentence order and outputting whether the predicted sentence order is correct (True: 1) or incorrect (False: 0). In addition to such a binary classification method, thefirst training module 104 may use a multi-class classification method of obtaining an error by outputting an order from 0 to n-1 (where n is a total number of sentences in the document). - The
first training module 104 may transmit the calculated first error to themachine learning model 110 to train themachine learning model 110 to minimize the first error. That is, thefirst training module 104 may adjust a weight or a parameter of an artificial neural network constituting themachine learning model 110 toward minimizing the first error. - Here, the
machine learning model 110 is trained to predict the order of sentences in the document by thefirst training module 104, so that themachine learning model 110 may recognize flow of information in the document and a relationship between the sentences, thereby improving performance thereof when performing extractive summary task. - The
second training module 106 may serve to train themachine learning model 110 so that a difference between two documents that differ only in the order of sentences is minimized (or similarity is maximized). That is, thesecond training module 106 may input two documents that differ in the order of sentences for one document to themachine learning model 110, and train themachine learning model 110 to minimize a difference between the two documents (in other words, similarity is maximized). Thesecond training module 106 may train themachine learning model 110 to perform a document similarity maximization (DSM) task. -
FIG. 4 is a diagram illustrating a state in which thesecond training module 106 trains themachine learning model 110 in an exemplary embodiment in the present disclosure. Referring toFIG. 4 , thesecond training module 106 may primarily rearrange the order of sentence token sequences with respect to the document tokenized by thepreprocessing module 102 and input the rearranged sentence token sequences into themachine learning model 110. In this case, a first document representative token D1 representing the document in which sentence token sequences are primarily rearranged may be located at the front of the document. - In addition, the
second training module 106 may secondarily rearrange the order of sentence token sequences with respect to the document tokenized by thepreprocessing module 102 and input the rearranged sentence token sequences into themachine learning model 110. In this case, a second document representative token D2 representing the document in which sentence token sequences are secondarily rearranged may be located at the front of the document. Here, the order of the secondly rearranged sentence token sequences should be different from the order of the primarily rearranged sentence token sequences. - Here, the
machine learning model 110 may embed the first document representative token D1 of the primarily rearranged document to generate a first embedding vector DE1. In addition, themachine learning model 110 may embed the second document representative token D2 of the secondly rearranged document to generate a second embedding vector DE2. - The
second training module 106 may calculate a second error through a difference between the first embedding vector DE1 and the second embedding vector DE2 output by themachine learning model 110. - The
second training module 106 may transmit the calculated second error to themachine learning model 110 to train themachine learning model 110 to minimize the second error. That is, thesecond training module 106 may adjust the weight or parameter of the artificial neural network constituting themachine learning model 110 toward minimizing the second error. - Meanwhile, here, it is described that a document in which the order of sentences is primarily rearranged and a document in which the order of sentences is secondarily rearranged are separately input to the
machine learning model 110 to train themachine learning model 110, but the present disclosure is not limited thereto, and - the
machine learning model 110 may be trained by inputting a document in which some of the sentences of the document are rearranged and a document in which the rest of the sentences of the document are rearranged to themachine learning model 110. For example, when a document includes ten sentences, first to fifth sentences may be rearranged and input to themachine learning model 110 and sixth to tenth sentences may be rearranged and input to themachine learning model 110. - A loss function (Loss) of the
machine learning model 110 trained by thefirst training module 104 and thesecond training module 106 may be expressed asEquation 1 below. -
(Equation) -
Loss=LossSOP+α·LossDSM (Equation 1) -
- LossSOP: Loss function for sentence order prediction
- LossDSM: Loss function for document similarity maximization
- α: normalization parameter
- Here, a may be a parameter for normalization of the loss function LossSOP and the loss function LossDSM.
-
FIG. 5 is a flowchart illustrating a method of training a machine learning model according to an exemplary embodiment in the present disclosure. The method illustrated inFIG. 5 may be performed, for example, by thetraining apparatus 100 illustrated inFIG. 1 . - Referring to
FIG. 5 , thetraining apparatus 100 segments a document into each sentence and extracts a representative sentence token representing each sentence (501). - Thereafter, the
training apparatus 100 tokenizes each sentence to generate a sentence token sequence (S503). In this case, thetraining apparatus 100 may locate a representative sentence token SR of the corresponding sentence at the front of the sentence token sequence of the corresponding sentence. - Thereafter, the
training apparatus 100 rearranges the order of sentence token sequences in the document and inputs the rearranged sentence token sequences to the machine learning model 110 (505). - Thereafter, the
training apparatus 100 calculates a first error for sentence order prediction by comparing a predicted sentence order output by themachine learning model 110 with an original sentence order (507). - At this time, the
machine learning model 110 generates an embedding vector by embedding the representative sentence token SR located at the front in the rearranged sentence token sequences, and output a predicted sentence order (value predicting the order of sentences in the document) based on the generated embedding vectors. - Thereafter, the
training apparatus 100 transmits the first error to themachine learning model 110 to adjust a weight of themachine learning model 110 toward minimizing the first error (509). - Thereafter, the
training apparatus 100 first rearranges the order of the sentence token sequences in the document, locates a first document representative token D1 representing the primarily rearranged document at the front of the document, and input the same to the machine learning model 110 (511). - Thereafter, the
training apparatus 100 secondarily rearranges the order of the sentence token sequences in the document, and locates a second document representative token D2 representing the secondly rearranged document at the front of the document, and input the same to the machine learning model 110 (513). - Thereafter, the
training apparatus 100 calculates a second error through a difference between the first embedding vector DE1 and the second embedding vector DE2 output from the machine learning model 110 (515). - Here, the
machine learning model 110 may generate the first embedding vector DE1 by embedding the first document representative token D1 of the primarily rearranged document and generate the second embedding vector DE2 by embedding the second document representative token D2 of the secondly rearranged document. - Thereafter, the
training apparatus 100 transmits the second error to themachine learning model 110 to adjust a weight of themachine learning model 110 toward minimizing the second error (517). - Meanwhile, in the flowchart illustrated in
FIG. 5 , the method is described as being segmented into a plurality of operations, but at least some of the operations may be performed in a different order, may be performed in combination with other operations, may be omitted, or may be segmented into detailed operations to be performed, or one or more operations (not shown) may be added and performed. -
FIG. 6 is a block diagram illustrating a configuration of an apparatus for summarizing a document according to an exemplary embodiment in the present disclosure, andFIG. 7 is a diagram illustrating a process of summarizing a document in the apparatus for summarizing a document according to an exemplary embodiment in the present disclosure. - Referring to
FIGS. 6 and 7 , anapparatus 600 for a document may include apreprocessing module 602, amachine learning module 604, and asummary extracting module 606. - The
preprocessing module 602 may segment an input document into each sentence and perform tokenization on each sentence. Thepreprocessing module 602 may extract a document representative token representing the corresponding document from the document, and may extract a representative sentence token representing the corresponding sentence from each sentence. Thepreprocessing module 602 may locate the document representative token D at the front of the document and may locate the representative sentence token SR at the front of the corresponding sentence, and may input the same to themachine learning module 604. - The
machine learning module 604 may include amachine learning model 604 a for performing extractive summary task. Themachine learning model 604 a may be a model trained according to the exemplary embodiments illustrated inFIGS. 1 to 5 . - The
machine learning model 604 a may embed the input document representative token D and output the document representative embedding vector DE. Themachine learning model 604 a may embed the representative sentence token SR in each input sentence token sequence and output representative sentence embedding vectors SRE1, SRE2, SRE3, . . . , SREn. - The
summary extracting module 606 may calculate a similarity between the document representative embedding vector DE and each of the representative sentence embedding vectors SRE1, SRE2, SRE3, . . . , SREn. Thesummary extracting module 606 may extract a sentence in which the calculated similarity is equal to or greater than a preset threshold value to summarize the corresponding document. - That is, the
machine learning model 604 a is trained to include information in the document well when generating the document representative embedding vector DE through self-supervised learning and is trained to properly adjust the order of the sentences in the document when generating the representative sentence embedding vectors SRE1, SRE2, SRE3, . . . , SREn, so that a representative sentence embedding vector most similar to the document representative embedding vector DE may have a high probability of properly reflecting the whole information of the corresponding document. Accordingly, thesummary extracting module 606 extracts a sentence having a calculated similarity equal to or greater than a preset threshold value to summarize the corresponding document. - In this case, the number of extracted sentences may be adjusted according to a preset threshold value. That is, as the preset threshold value is higher, the number of sentences extracted from the document is reduced. Here, a non-fixed number of sentences may be extracted from the document by using the threshold value, and the top N sentences may be extracted in a ranking method.
- According to the disclosed exemplary embodiment, the machine learning model may be trained to predict the order of sentences in a document and minimize a difference between two documents that differ only in sentence order, so that the machine learning model may extract an important sentence even for a document with no correct answers, may exclude a subjectivity factor of a specific person when summarizing the document, and may summarize the document by extracting various numbers of sentences from the document. Also, the time and cost required to train the machine learning model may be reduced.
- In this specification, a module may refer to a functional and structural combination of hardware for carrying out the technical idea of the present disclosure and software for driving the hardware. For example, the “module” may refer to a logical unit of a predetermined code and a hardware resource for executing the predetermined code, and may not refer to a physically connected code or one type of hardware.
-
FIG. 8 is a flowchart illustrating a method for summarizing a document based on machine learning according to an exemplary embodiment in the present disclosure. The method illustrated inFIG. 8 may be performed, for example, by theapparatus 600 for summarizing a document ofFIG. 6 . - Referring to
FIG. 8 , theapparatus 600 for summarizing a document segments an input document into sentences and then performs tokenization (801). Here, theapparatus 600 for summarizing a document extracts a document representative token from the document, and extracts a representative sentence token representing the corresponding sentence from each sentence. - Thereafter, the
apparatus 600 for summarizing a document locates the document representative token at the front of the document, locates the representative sentence tokens at the front of the corresponding sentence, and inputs a token sequence for the corresponding document into themachine learning model 604 a (803). - Thereafter, the
apparatus 600 for summarizing a document outputs a document representative embedding vector in which the document representative token is embedded through themachine learning model 604 a, and outputs the representative sentence embedding vectors in which each representative sentence token is embedded (805). - Thereafter, the
apparatus 600 for summarizing a document calculates a similarity between the document representative embedding vector and each representative sentence embedding vector (807). - Thereafter, the
apparatus 600 for summarizing a document extracts sentences having the calculated similarity equal to or greater than a preset threshold value and summarizes the corresponding document (S809). -
FIG. 9 is a block diagram illustrating acomputing environment 10 including a computing device suitable for use in exemplary embodiments. In the illustrated exemplary embodiment, each component may have different functions and capabilities other than those described below, and may include additional components in addition to those described below. - The illustrated
computing environment 10 includes acomputing device 12. In an exemplary embodiment, thecomputing device 12 may be atraining apparatus 100 of a machine learning model. In addition, thecomputing device 12 may be anapparatus 600 for summarizing a document. - The
computing device 12 includes at least oneprocessor 14, a computer-readable storage medium 16, and acommunication bus 18. Theprocessor 14 may cause thecomputing device 12 to operate according to the exemplary embodiments described above. For example, theprocessor 14 may execute one or more programs stored in the computer-readable storage medium 16. The one or more programs may include one or more computer-executable instructions that, when executed by theprocessor 14, causes thecomputing device 12 to perform operations according to the exemplary embodiment. - The computer-
readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. Theprogram 20 stored in the computer-readable storage medium 16 includes a set of instructions executable by theprocessor 14. In an exemplary embodiment, the computer-readable storage medium 16 includes a memory (a volatile memory, such as random access memory, a non-volatile memory, or suitable combinations thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other forms of storage mediums that may be accessed by computingdevice 12 and store desired information, or suitable combinations thereof. - The
communication bus 18 interconnects various other components ofcomputing device 12, including theprocessor 14 and the computer-readable storage medium 16. - The
computing device 12 may also include one or more input/output (I/O) interfaces 22 providing interfaces for one or more I/O devices 24 and one or more network communication interfaces 26. The I/O interface 22 and thenetwork communication interface 26 are connected to thecommunication bus 18. The I/O device 24 may be connected to other components of thecomputing device 12 via the I/O interface 22. The I/O device 24 may include input devices such as pointing devices (such as computer mouses or trackpads), keyboards, touch input devices (such as touchpads or touchscreens), voice or sound input devices, various types of sensor devices, and/or imaging devices and/or output devices, such as display devices, printers, speakers, and/or network cards. The I/O device 24 may be included in thecomputing device 12, as a component constituting thecomputing device 12, and may be connected to thecomputing device 12, as a separate device distinct from thecomputing device 12. - According to the exemplary embodiments, the machine learning model is trained to predict the order of sentences in a document and minimize a difference between two documents that differ only in sentence order, so that ab important sentence may be extracted even for a document with no correct answers, and when summarizing a document, a subjectivity factor of a specific person may be excluded and a various number of sentences may be extracted from the document to summarize the document. In addition, the time and cost required to train the machine learning model may be reduced.
- While exemplary embodiments have been illustrated and described above, it will be apparent to those skilled in the art that modifications and variations could be made without departing from the scope of the present disclosure as defined by the appended claims.
Claims (23)
1. An apparatus of training a machine learning model, the apparatus comprising:
a processor;
a memory storing one or more programs configured to be executed by the processor; and
the one or more programs including instructions for:
a preprocessing module configured for segmenting a document into sentences and performing tokenization to generate a token sequence for the document, the token sequence including a document representative token representing the document and representative sentence tokens representing the sentences, respectively;
a first training module configured for training the machine learning model to predict an order of the sentences in the document, based on the token sequence for the document; and
a second training module configured for training the machine learning model to perform document similarity maximization, based on the token sequence for the document.
2. The apparatus of claim 1 , wherein the preprocessing module is configured for tokenizing the sentences to generate sentence token sequences; and
the first training module is configured to rearrange an order of the sentence token sequences and inputs the rearranged sentence token sequences to the machine learning model.
3. The apparatus of claim 2 , wherein the representative sentence token for a corresponding sentence is located at the front of the sentence token sequence, and the machine learning model embeds the representative sentence tokens in the rearranged sentence token sequences to generate embedding vectors, respectively, and predicts the order of the sentences in the document, based on the generated embedding vectors.
4. The apparatus of claim 3 , wherein the first training module is configured to calculate a first error by comparing a predicted sentence order output from the machine learning model with an original sentence order of the document and adjust a weight of the machine learning model to minimize the first error.
5. The apparatus of claim 1 , wherein the second training module is configured to input two documents that differ in the order of sentences to the machine learning model, and train the machine learning model so that a difference between the two documents is minimized.
6. The apparatus of claim 5 , wherein the second training module is configured to primarily rearrange the order of the sentence token sequences in the document, and input the primarily rearranged sentence token sequences and a first document representative token representing the primarily rearranged document to the machine learning model.
7. The apparatus of claim 6 , wherein the second training module is configured to:
secondarily rearrange the sentence token sequences in the document so that the order of the secondarily rearranged sentence token sequences is different from the order of the primarily rearranged sentence token sequences; and
input the secondarily rearranged sentence token sequences and a second document representative token representing the secondarily rearranged document to the machine learning model.
8. The apparatus of claim 7 , wherein the machine learning model is configured to embed the first document representative token to generate a first embedding vector, and embed the second document representative token to generate a second embedding vector.
9. The apparatus of claim 8 , wherein the second training module is configured to calculate a second error through a difference between the first embedding vector and the second embedding vector, and adjust a weight of the machine learning model to minimize the second error.
10. The apparatus of claim 1 , wherein a loss function (Loss) of the machine learning model is expressed by the following equation:
Loss=LossSOP+α·LossDSM [Equation]
Loss=LossSOP+α·LossDSM [Equation]
where LossSOP is a loss function for sentence order prediction;
LossDSM is a loss function for document similarity maximization
α is a normalization parameter
11. A method of training a machine learning model, the method performed by an apparatus comprising a preprocessing module, a first training module, and a second training module, the method comprising:
segmenting, by the preprocessing module, a document into sentences and performing tokenization to generate a token sequence for the document, the token sequence including a document representative token representing the document and representative sentence tokens representing the sentences, respectively;
training, with the first training module, the machine learning model to predict an order of sentences in the document, based on the token sequence for the document; and
training, with the second training module, the machine learning model to perform document similarity maximization based on the token sequence for the document.
12. The method of claim 11 , further comprising: tokenizing the sentences to generate sentence token sequences,
wherein the training of the machine learning model to predict the order of sentences includes:
rearranging the order of the sentence token sequences in the document and inputting the rearranged sentence token sequences into the machine learning model;
calculating a first error by comparing a predicted sentence order output from the machine learning model with an original sentence order of the document; and
adjusting a weight of the machine learning model to minimize the first error.
13. The method of claim 12 , wherein the representative sentence token for a corresponding sentence is located at the front of the sentence token sequence, and the machine learning model embeds the representative sentence tokens in the rearranged sentence token sequences to generate embedding vectors, respectively, and predicts the order of the sentences in the document, based on the generated embedding vectors.
14. The method of claim 11 , wherein the training of the machine learning model to perform document similarity maximization includes:
inputting two documents that differ in the order of sentences to the machine learning model; and
training the machine learning model so that a difference between the two documents is minimized.
15. The method of claim 14 , wherein the training of the machine learning model to perform document similarity maximization includes:
primarily rearranging the order of the sentence token sequences in the document;
inputting the primarily rearranged sentence token sequences and a first document representative token representing the primarily rearranged document to the machine learning model;
secondarily rearranging the sentence token sequences in the document so that the order of the secondarily rearranged sentence token sequences is different from the order of the primarily rearranged sentence token sequences; and
inputting the secondarily rearranged sentence token sequences and a second document representative token representing the secondarily rearranged document to the machine learning model.
16. The method of claim 15 , wherein the machine learning model embeds the first document representative token to generate a first embedding vector, and embeds the second document representative token to generate a second embedding vector.
17. The method of claim 16 , wherein the training of the machine learning model to perform document similarity maximization includes:
calculating a second error through a difference between the first embedding vector and the second embedding vector; and
adjusting a weight of the machine learning model to minimize the second error.
18. The method of claim 11 , wherein a loss function (Loss) of the machine learning model is expressed by the following equation:
Loss=LossSOP+α·LossDSM [Equation]
Loss=LossSOP+α·LossDSM [Equation]
where LossSOP is a Loss function for sentence order prediction;
LossDSM is a Loss function for document similarity maximization; and
α is a normalization parameter.
19. An apparatus for summarizing a document, the apparatus comprising:
a processor;
a memory storing one or more programs configured to be executed by the processor; and
the one or more programs including instructions for:
a preprocessing module configured for segmenting a document into sentences and performing tokenization to generate a token sequence for the document, the token sequence including a document representative token representing the document and representative sentence tokens representing the sentences, respectively;
a machine learning module including a machine learning model, the machine learning module configured for receiving the token sequence for the document and embedding the document representative token and the representative sentence tokens to output a document representative embedding vector and representative sentence embedding vectors; and
a summary extracting module configured for calculating a similarity between the document representative embedding vector and each representative sentence embedding vectors and summarizing the document according to the calculated similarity.
20. The apparatus of claim 19 , wherein the preprocessing module is configured to locate the document representative token at a front of the document, and locate each representative sentence token at a front of a corresponding sentence.
21. The apparatus of claim 20 , wherein the machine learning model is configured to be trained to predict an order of the sentences in the document based on the token sequence for the document, and to be trained to perform document similarity maximization, based on the token sequence for the document.
22. The apparatus of claim 21 , wherein the summary extracting module is configured to summarize the document by extracting a sentence having the similarity greater than or equal to a preset threshold value.
23. A method for summarizing a document, the method performed by an apparatus comprising a preprocessing module, a machine learning module, a summary extracting module, the method comprising:
segmenting, by the preprocessing module, a document into sentences and performing tokenization to generate a token sequence for the document, the token sentence including a document representative token representing the document and representative sentence tokens representing the sentences, respectively;
receiving, by the machine learning module including a machine learning model, the token sequence for the document and embedding the document representative token and the representative sentence tokens to output a document representative embedding vector and representative sentence embedding vectors; and
calculating, by a summary extracting module, a similarity between the document representative embedding vector and each representative sentence embedding vector and summarizing the document according to the calculated similarity.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020210140881A KR20230056985A (en) | 2021-10-21 | 2021-10-21 | Training apparatus and method of machine learning model, and apparatus and method for document summary using the same |
KR10-2021-0140881 | 2021-10-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230131259A1 true US20230131259A1 (en) | 2023-04-27 |
Family
ID=86055775
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/968,193 Pending US20230131259A1 (en) | 2021-10-21 | 2022-10-18 | Apparatus and method of training machine learning model, and apparatus and method for summarizing document using the same |
Country Status (2)
Country | Link |
---|---|
US (1) | US20230131259A1 (en) |
KR (1) | KR20230056985A (en) |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102264899B1 (en) | 2019-03-15 | 2021-06-11 | 에스케이텔레콤 주식회사 | A natural language processing system, a learning method for the same and computer-readable recording medium with program |
-
2021
- 2021-10-21 KR KR1020210140881A patent/KR20230056985A/en unknown
-
2022
- 2022-10-18 US US17/968,193 patent/US20230131259A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
KR20230056985A (en) | 2023-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11636264B2 (en) | Stylistic text rewriting for a target author | |
US11120801B2 (en) | Generating dialogue responses utilizing an independent context-dependent additive recurrent neural network | |
US11775838B2 (en) | Image captioning with weakly-supervised attention penalty | |
US11651163B2 (en) | Multi-turn dialogue response generation with persona modeling | |
CN111859960B (en) | Semantic matching method, device, computer equipment and medium based on knowledge distillation | |
US11803758B2 (en) | Adversarial pretraining of machine learning models | |
US11669712B2 (en) | Robustness evaluation via natural typos | |
CA3087534A1 (en) | System and method for information extraction with character level features | |
US11521071B2 (en) | Utilizing deep recurrent neural networks with layer-wise attention for punctuation restoration | |
US11954435B2 (en) | Text generation apparatus, text generation learning apparatus, text generation method, text generation learning method and program | |
US20210133279A1 (en) | Utilizing a neural network to generate label distributions for text emphasis selection | |
Moeng et al. | Canonical and surface morphological segmentation for nguni languages | |
Zeng et al. | Beyond OCR+ VQA: Towards end-to-end reading and reasoning for robust and accurate textvqa | |
US20230131259A1 (en) | Apparatus and method of training machine learning model, and apparatus and method for summarizing document using the same | |
US20220254351A1 (en) | Method and system for correcting speaker diarization using speaker change detection based on text | |
US11880798B2 (en) | Determining section conformity and providing recommendations | |
KR20220046987A (en) | Mehtod and apparatus for detecting object contained within paragraph | |
CN112380850A (en) | Wrongly-written character recognition method, wrongly-written character recognition device, wrongly-written character recognition medium and electronic equipment | |
CN112136136A (en) | Input error detection device, input error detection method, and input error detection program | |
KR20200010679A (en) | Heterogeneity learning based information classification apparatus | |
CN115048906B (en) | Document structuring method and device, electronic equipment and storage medium | |
CN113822018B (en) | Entity relation joint extraction method | |
KR102498670B1 (en) | Mehtod and apparatus for ranking passages | |
US20230140338A1 (en) | Method and apparatus for document summarization | |
US20240144664A1 (en) | Multimodal data processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SAMSUNG SDS CO., LTD., KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEE, HYUNJAE;KIM, JUDONG;CHOI, HYUNJIN;AND OTHERS;SIGNING DATES FROM 20221013 TO 20221018;REEL/FRAME:061455/0342 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |