WO2022174804A1 - Procédé et appareil de simplification de texte, dispositif et support de stockage - Google Patents

Procédé et appareil de simplification de texte, dispositif et support de stockage Download PDF

Info

Publication number
WO2022174804A1
WO2022174804A1 PCT/CN2022/076729 CN2022076729W WO2022174804A1 WO 2022174804 A1 WO2022174804 A1 WO 2022174804A1 CN 2022076729 W CN2022076729 W CN 2022076729W WO 2022174804 A1 WO2022174804 A1 WO 2022174804A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
difficulty coefficient
difficulty
coefficient
target
Prior art date
Application number
PCT/CN2022/076729
Other languages
English (en)
Chinese (zh)
Inventor
张闯
吴培昊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022174804A1 publication Critical patent/WO2022174804A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the embodiments of the present disclosure relate to the technical field of natural language processing, for example, to a text simplification method, apparatus, device, and storage medium.
  • Text simplification refers to converting text containing complex sentence patterns and vocabulary into text with simple sentence patterns and vocabulary in order to reduce the difficulty and complexity of the text. Simplified texts are easier to understand and read for foreign language learners or people with low knowledge levels.
  • Traditional text simplification methods mainly include rule-based simplification methods, lexical simplification methods, grammar simplification methods and end-to-end simplification methods, but the effect is not good.
  • Embodiments of the present disclosure provide a text simplification method, apparatus, device, and storage medium, which can optimize related text simplification schemes to meet the needs of different groups of people.
  • an embodiment of the present disclosure provides a text simplification method, including:
  • the first text is simplified to obtain a target text corresponding to the target text difficulty coefficient.
  • an embodiment of the present disclosure further provides a text simplification device, including:
  • an acquisition module configured to acquire the target text difficulty coefficient and the first text to be simplified, and determine the first text difficulty coefficient of the first text, where the target text difficulty coefficient is the simplified text difficulty coefficient of the first text;
  • the simplification module is configured to simplify the first text according to the first text difficulty coefficient and the target text difficulty coefficient to obtain a target text corresponding to the target text difficulty coefficient.
  • an embodiment of the present disclosure also provides an electronic device, including:
  • processors one or more processors
  • memory arranged to store one or more programs
  • the text reduction method of the first aspect is implemented when the one or more programs are executed by the one or more processors.
  • an embodiment of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the text simplification method described in the first aspect.
  • FIG. 1 is a flowchart of a text simplification method provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram showing a user interface provided by another embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a syntax tree according to another embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram showing a first text and a target text according to another embodiment of the present disclosure.
  • FIG. 6 is a flowchart of a text simplification method provided by another embodiment of the present disclosure.
  • FIG. 7 is a structural diagram of a text simplification device according to an embodiment of the present disclosure.
  • FIG. 8 is a structural diagram of an electronic device according to an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of a text simplification method provided by an embodiment of the present disclosure. This embodiment is applicable to the case of simplification of text containing complex sentence patterns and vocabulary, and the simplified text is easier for foreign language learners or knowledge Low level users understand and read.
  • the method may be executed by a text reduction apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device with a data processing function. As shown in Figure 1, the method may include the following steps:
  • the target text difficulty coefficient is the text difficulty coefficient of the simplified first text.
  • the first text may be text containing complex sentence patterns and vocabulary, and may also be referred to as complex text, which is difficult to understand and read for foreign language learners or users with low knowledge levels. To meet the needs of more users, complex text can be simplified and reduced in difficulty and complexity.
  • the first text may be acquired from a graded reading website, or may be acquired from an existing public data set, and of course may also be acquired in other ways, which are not specifically limited in the embodiment.
  • the first text may contain only one language, for example, may contain only Chinese or English, or may contain multiple languages at the same time, for example, may contain both Chinese and English, of course, other types of languages may also be selected according to requirements.
  • the text difficulty coefficient is used to indicate the complexity of the corresponding text.
  • the first text difficulty coefficient in this embodiment may indicate the complexity of the first text
  • the target text difficulty coefficient may indicate the complexity of the target text.
  • the target text meets the user's needs.
  • Text that is, the simplified first text.
  • the text difficulty factor may include one or more of a lexical difficulty factor, a syntactic difficulty factor, and a length difficulty factor. The more information the text difficulty coefficient contains, the easier the obtained simplified text is to meet the needs of users.
  • the lexical difficulty coefficient is used to indicate the difficulty of the vocabulary contained in the text. For example, it can be represented by the frequency of use of the corresponding vocabulary. The higher the frequency of use, the simpler the vocabulary is, and the higher the corresponding vocabulary difficulty coefficient.
  • the syntactic difficulty coefficient is used to indicate the difficulty degree of the sentence patterns used in the text.
  • the difficulty degree of the syntax can be determined by analyzing the sentence patterns contained in the text, and the syntactic difficulty coefficient can be obtained.
  • the sentence patterns contained in the text may include, but are not limited to, attributive clauses, adverbial clauses and other sentence patterns.
  • the length difficulty coefficient is used to indicate the length of the sentence corresponding to the text. Generally, the longer the sentence is, the more difficult it is to understand, and the shorter the sentence is, the easier it is to understand. For example, the number of words in the sentence corresponding to the text can be counted as the length of the sentence, and the length difficulty coefficient can be obtained. Punctuation can be ignored when counting the vocabulary of sentences.
  • the target text difficulty factor can be customized by the user according to their own needs, or can be determined according to the user's information. For example, one or more of the corresponding lexical difficulty factor, syntactic difficulty factor, and length difficulty factor can be determined according to the user's knowledge level. , get the difficulty coefficient of the target text.
  • the content included in the first text difficulty coefficient corresponds to the content included in the target text difficulty coefficient, so that the first text can be effectively simplified.
  • the target text difficulty coefficient also includes the lexical difficulty coefficient; for example, when the first text difficulty coefficient includes the lexical difficulty coefficient and the syntactic difficulty coefficient, the target text difficulty coefficient also includes the lexical difficulty coefficient and the syntax difficulty coefficient. degree of difficulty.
  • the present embodiment can simplify the first text based on the text difficulty coefficient of the first text and the target text difficulty coefficient, so as to obtain simple texts that meet the needs of different users, so that the degree of simplification of the texts is controllable, Improved flexibility for text simplification.
  • the vocabulary difficulty ratio may be determined based on the first text difficulty coefficient and the target text difficulty coefficient, and the vocabulary difficulty ratio and the first text difficulty ratio may be determined.
  • the pre-trained text simplification model is input, and the simplified text is output by the pre-trained text simplification model as the target text corresponding to the difficulty coefficient of the target text.
  • the first text difficulty coefficient and the target text difficulty coefficient also include other difficulty coefficients, for example, the first text difficulty coefficient and the target text difficulty coefficient also include the syntactic difficulty coefficient
  • the syntactic difficulty ratio can be determined simultaneously, and the lexical difficulty ratio and the syntactic difficulty ratio can be determined and the first text input to the pre-trained text simplification model to obtain the target text.
  • the structure and training methods of text simplification models corresponding to different text difficulty coefficients are different.
  • the text simplification model may adopt an end-to-end neural network model, and the embodiment does not limit the specific structure of the neural network model.
  • the first text may also be simplified in other ways in combination with the first text difficulty coefficient and the target text difficulty coefficient, which is not specifically limited in the embodiment.
  • An embodiment of the present disclosure provides a text simplification method, by acquiring a target text difficulty coefficient and a first text to be simplified, and determining a first text difficulty coefficient of the first text, where the target text difficulty coefficient is a simplified The text difficulty coefficient of the first text; according to the first text difficulty coefficient and the target text difficulty coefficient, simplify the first text to obtain a target text corresponding to the target text difficulty coefficient.
  • the above solution introduces a text difficulty coefficient, and according to the text difficulty coefficient of the text to be simplified and the text difficulty coefficient of the target text, the simplified text is simplified, so that the degree of simplification of the text is controllable and simplification requirements of different users are met.
  • FIG. 2 is a flowchart of a text simplification method provided by another embodiment of the present disclosure.
  • the text simplification process is described by taking the text difficulty coefficient including the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient as an example.
  • Figure 2 the method may include the following steps:
  • the target text difficulty factor can be obtained as follows:
  • the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient of the target text input by the user are received, and the target text difficulty coefficient is obtained.
  • FIG. 3 is a schematic diagram showing a user interface provided by another embodiment of the present disclosure.
  • the user interface can be displayed in an electronic device such as a mobile phone or a computer.
  • the user interface It includes a text difficulty coefficient display area for complex text, a text difficulty coefficient display area for target text, and a complex text input area, where the complex text is the first text described in the above embodiment.
  • the text difficulty coefficient display area of the complex text is used to display the text difficulty coefficient of the complex text, and the value of each part is determined by the electronic device based on the complex text input in the complex text input area.
  • the text difficulty coefficient display area of the target text is used to display the text difficulty coefficient of the target text. Users can input the corresponding vocabulary difficulty coefficient, syntax difficulty coefficient and length difficulty coefficient in this area according to their own difficulty requirements. The text is more in line with the user's needs.
  • the text difficulty coefficient display area of the target text shown in FIG. 3 exemplarily provides a simplified requirement, that is, the lexical difficulty coefficient and the syntactic difficulty coefficient are both 2, and the length difficulty coefficient is 3.
  • the black positive triangle represents the increase
  • the black inverted triangle represents the decrease
  • click the black positive triangle the value of the corresponding difficulty coefficient increases
  • click the black inverted triangle the corresponding difficulty coefficient value decreases, thus realizing the text simplification
  • the degree of controllability meets the difficulty requirements of different users.
  • the difficulty coefficient of the target text can also be obtained in the following manner:
  • the identity information including the knowledge level of the user
  • the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient of the target text are determined to obtain the target text difficulty coefficient.
  • the user's knowledge level can be expressed by primary, intermediate and advanced levels, or by A, B, C, etc.
  • the primary or A level indicates that the user's knowledge level is low or there is cognitive impairment
  • the intermediate or B level indicates that the user's knowledge level is low.
  • the user's knowledge level is medium, and the advanced or C level indicates that the user's knowledge level is high. Of course, other levels can be added, such as lower level and higher level.
  • the finer the division the more simplified text obtained is in line with the user's needs.
  • the user's identity information may be information that reflects the user's knowledge level, for example, may include the user's educational background or educational background, and the user's knowledge level is determined based on the user's educational background or educational background.
  • the corresponding text difficulty coefficient can be further determined.
  • the text difficulty coefficient table may be searched based on the knowledge level of the user, and the lexical difficulty coefficient, syntactic difficulty coefficient, and length difficulty coefficient corresponding to the knowledge level are obtained.
  • the text difficulty coefficient table is used to store the text difficulty coefficients corresponding to different knowledge levels.
  • a knowledge level can correspond to a text difficulty coefficient or a variety of text difficulty coefficients. When it corresponds to a variety of text difficulty coefficients, one can be selected from among them.
  • a text difficulty coefficient with lower difficulty is used as the text difficulty coefficient corresponding to the knowledge level, which is convenient for users to understand and read.
  • the text difficulty coefficient can also be customized according to the knowledge level of the user.
  • the obtained text difficulty coefficient can be random, that is, the text difficulty coefficient obtained by different users of the same knowledge level may be different or fixed. , that is, the text difficulty coefficients obtained by different users of the same knowledge level are the same.
  • users can choose an appropriate determination method according to their own needs. This method does not require the user to determine the text difficulty factor by himself, so as to avoid the situation that some users cannot obtain a suitable simplified text because they do not know how to determine an appropriate text difficulty factor.
  • the embodiment does not limit the order of determining the vocabulary difficulty coefficient, the syntactic difficulty coefficient, and the length difficulty coefficient.
  • the above three kinds of difficulty coefficients may be determined at the same time, or each difficulty coefficient may be determined in a certain order.
  • the vocabulary difficulty coefficient may be determined first, Then determine the syntax difficulty factor, and finally determine the length difficulty factor.
  • the lexical difficulty factor of the first text may be determined as follows:
  • the corpus sample includes at least one participle
  • the corpus sample is input into a pre-trained word embedding model, and the pre-trained word embedding model outputs the frequency of each participle in the corpus sample appearing in the corpus sample, and based on each participle and the corresponding The frequency forms a dictionary;
  • the average of the frequencies of the at least one text word segmentation is used as the lexical difficulty coefficient of the first text.
  • the corpus sample is used to construct a dictionary, which contains the word segments contained in the corpus sample and the frequency of each word segment in the corpus sample.
  • the accuracy of the dictionary directly affects the accuracy of the lexical difficulty coefficient.
  • the embodiment does not limit the acquisition method of the corpus sample, for example, it can be obtained from the Wikipedia corpus. In order to improve the accuracy of the result, a large number of sentences can be selected from the Wikipedia corpus as the corpus sample.
  • the word embedding model is used to determine the frequency of occurrence of each word in the input corpus sample in the corpus sample. For example, a fasttext (fast text classification) model can be trained to obtain a word embedding model.
  • the output result of the word embedding model is in the form of ⁇ "we":5923,”I”:8765 ⁇ , where 5923 represents the frequency of the participle "we” in the selected corpus sample, and 8765 represents the participle "I” in the selected corpus sample.
  • the frequency of occurrence in the corpus sample In practical applications, the output result of the word embedding model contains more word segmentations and their frequency. Based on the output result of the word embedding model, a dictionary containing a certain number of word segmentations and corresponding frequencies can be obtained, which provides a basis for the subsequent determination of the lexical difficulty coefficient of the first text.
  • the lexical difficulty coefficient of the first text may be determined based on the frequency corresponding to each participle in the first text, and the frequency corresponding to each participle in the first text may be obtained by searching the above dictionary.
  • the first text contains the participle "we", looking up the above dictionary, it can be obtained that the frequency corresponding to the participle is 5923.
  • word segmentation processing may be performed on the first text to obtain at least one word segmentation included in the first text, and the embodiment does not limit the specific process of word segmentation.
  • the frequency corresponding to each participle in the first text can be obtained.
  • the average value of the multiple word segmentation frequencies in the first text can be determined, and the average value can be used as the lexical difficulty coefficient of the first text.
  • the syntactic difficulty factor of the first text may be determined as follows:
  • the number of levels included in the syntax tree is determined, and the number of levels is used as a syntax difficulty coefficient of the first text.
  • a syntax parsing tool can be used to parse the first text to obtain a syntax tree, and the difficulty of its syntax can be measured based on the height of the syntax tree.
  • the syntax parsing tool can use the open source CoreNLP tool.
  • CoreNLP is a set of natural language analysis tools written in Java that can give information such as the form, part of speech, and affiliation of the input text. Referring to FIG. 4, FIG. 4 exemplarily provides a schematic diagram of a syntax tree, and the first text corresponding to the syntax tree is the dog saw a man in the park.
  • S stands for the first text
  • NP stands for noun phrase
  • VP verb phrase
  • V stands for verb
  • P stands for preposition
  • N noun
  • Det stands for preposition
  • PP stands for participle phrase.
  • the number of levels of the syntax tree is 4, that is, the height is 4, and the corresponding syntax difficulty coefficient of the first text is 4. The higher the syntax tree, the harder the syntax of the text.
  • the length difficulty factor of the first text can be determined as follows:
  • word segmentation can be performed on the first text, and then the number of word segmentations contained in the text can be counted to obtain the length difficulty coefficient of the first text. For example, if the first text contains 5 word segmentations, the length difficulty coefficient of the first text is 5. When counting the number of word segmentations contained in the first text, punctuation can be ignored. Usually, the more word segments the text contains, the longer it is and the harder it is to understand.
  • a first mark is added to the word to be deleted; in response to determining that the target text contains a replacement word, a second mark is added to the replacement word; in response to determining The target text contains an inserted word to which a third token is added.
  • a certain word may or may not need to be deleted.
  • a special mark may be added to the word to be deleted, that is, the above-mentioned first mark.
  • the target text obtained after simplification is similar, that is, the actually obtained target text may contain replacement words and inserted words, and special marks, ie, the above-mentioned second mark and third mark, can also be added when the replaced words and inserted words are included. Special marking of words to delete, replace words and insert words makes it easy for users to see and understand the simplification process. The embodiments do not limit the ways of marking different words.
  • FIG. 5 exemplarily shows a schematic diagram of the display of the first text and the target text.
  • the figure shows the text before simplification, that is, the first text described in the above-mentioned embodiment, and the simplified text, that is, the above-mentioned embodiment. the target text described.
  • the original text before the simplification is: History Landsberg prison , which is in the town's western outskirts, was completed in 1910
  • the simplified text is History Landsberg prison is in the town's western part. It was completed in 1910.
  • Figure 5 shows the words ",which" and "outskirts” to be deleted in the first text in underlined form, the replacement word "part” in the target text in bold, and the insertion word ". It” is shown in italics.
  • the words to be deleted can be represented by red
  • the words to be replaced can be represented by green
  • the words to be inserted can be represented by blue, so that users can easily and intuitively see the effect before and after the simplification.
  • the embodiment of the present disclosure provides a text simplification method, which introduces a text difficulty coefficient into the text simplification, so that the corresponding text can be simplified according to the difficulty requirements of users, so as to meet the needs of different users, realize the controllable degree of text simplification, and increase the Diversity of simplified text.
  • FIG. 6 is a flowchart of a text simplification method provided by another embodiment of the present disclosure. This embodiment is refined on the basis of the foregoing embodiment. Referring to FIG. 6 , the method may include the following steps:
  • the text difficulty control coefficient is used to control the degree of simplification of the first text, and may include at least one of a lexical difficulty coefficient ratio, a syntactic difficulty coefficient ratio, and a length difficulty coefficient ratio.
  • the corresponding text difficulty control coefficient may include the lexical difficulty coefficient ratio.
  • the corresponding text difficulty control coefficient may include the vocabulary difficulty coefficient. The ratio of the ratio and the syntactic difficulty coefficient.
  • the corresponding text difficulty control coefficient may include the lexical difficulty coefficient ratio, the syntactic difficulty coefficient ratio and the length difficulty coefficient ratio.
  • the target text simplification model is a trained text simplification model, and the text simplification model in this embodiment takes an end-to-end neural network model as an example.
  • the target text simplification model can be obtained in the following ways:
  • the training samples include complex samples and simple samples corresponding to the complex samples;
  • the initial text simplification model whose similarity is greater than or equal to the set threshold is recorded as the target text simplification model.
  • Training samples can be obtained from the wikilarge data sets of Wikipedia and Simple Wikipedia; they can also be obtained from hierarchical reading networks such as newsela or readworks by data crawling, and then complex samples and simple samples can be obtained based on the obtained data; it can also be manually rewritten It can be obtained by the relevant personnel, that is, the relevant personnel can rewrite the complex text into simplified text to obtain training samples.
  • the relevant personnel can rewrite the complex text into simplified text to obtain training samples.
  • the difficulty coefficient of complex text corresponding to complex samples and the difficulty coefficient of simple text corresponding to simple samples are obtained, and then the control coefficient of sample difficulty is obtained.
  • the size of the set threshold can be set according to actual needs.
  • the initial text simplification model can be a Seq2Seq model composed of an encoder and a decoder. Both the encoder and the decoder can be composed of Long Short-Term Memory (LSTM), Transformer and other modules.
  • the Transformer module can be pre-trained
  • the text-to-text conversion model Transfer Text-to-Text Transformer, T5
  • T5 Transfer Text-to-Text Transformer
  • the simplified target text model obtained through the above training can provide low-difficulty simplified texts for elementary school students who have low requirements on vocabulary, syntax and length, and high difficulty for high school students who have high requirements on vocabulary, syntax and length.
  • the simplified text can effectively avoid the situation that the traditional Seq2Seq model cannot control the degree of simplification of the text, resulting in primary and high school students getting the same simplified text, which cannot meet the needs of primary school students.
  • the text difficulty control coefficient and the first text can be input into the target text simplification model, and the corresponding simplified text is output from the target text simplification model.
  • the user can input the complex text to be simplified in the complex text input area of the user interface shown in FIG. 3, and the electronic device can determine the corresponding lexical difficulty coefficient, syntax difficulty coefficient and length difficulty coefficient based on the received complex text, And displayed in the text difficulty coefficient display area of the complex text shown in Figure 3.
  • the user can also input the desired lexical difficulty coefficient, syntax difficulty coefficient and length difficulty coefficient in the text difficulty coefficient display area of the target text shown in FIG. 3 .
  • the electronic device When the electronic device detects that the user determines the simplification, it can determine the text difficulty corresponding to this simplification based on the lexical difficulty coefficient, syntax difficulty coefficient and length difficulty coefficient input by the user, combined with the lexical difficulty coefficient, syntax difficulty coefficient and length difficulty coefficient of complex texts Control coefficient, and then input the text difficulty control coefficient and complex text into the target text simplified model, and output the corresponding simplified text from the target text simplified model, and display it in the area shown in Figure 5 to facilitate the user to view the simplified effect.
  • the user can click the confirmation button of the user interface to determine to simplify the current text, and the electronic device can perform subsequent simplification operations when detecting that the confirmation button is triggered.
  • the embodiments of the present disclosure provide a text simplification method.
  • an initial text simplification model can be trained by using text difficulty control parameters and complex samples, so that the target text simplification model obtained by training can control complex text according to the user's difficulty requirements.
  • the degree of simplification improves the diversity and flexibility of simplified text, and meets the needs of different users.
  • the simplification process is similar to that of including the above three parameters at the same time, and details are not repeated here.
  • FIG. 7 is a structural diagram of a text simplification apparatus provided by an embodiment of the present disclosure.
  • the apparatus can perform the text simplification method described in the above-mentioned embodiments.
  • the apparatus may include:
  • the obtaining module 41 is configured to obtain the target text difficulty coefficient and the first text to be simplified, and determine the first text difficulty coefficient of the first text, and the target text difficulty coefficient is the text difficulty coefficient of the simplified first text ;
  • the simplification module 42 is configured to simplify the first text according to the first text difficulty coefficient and the target text difficulty coefficient to obtain a target text corresponding to the target text difficulty coefficient.
  • An embodiment of the present disclosure provides a text simplification device, by acquiring a target text difficulty coefficient and a first text to be simplified, and determining a first text difficulty coefficient of the first text, where the target text difficulty coefficient is the simplified first text difficulty coefficient A text difficulty coefficient of a text; according to the first text difficulty coefficient and the target text difficulty coefficient, the first text is simplified to obtain a target text corresponding to the target text difficulty coefficient.
  • the above solution introduces a text difficulty coefficient. According to the text difficulty coefficient of the text to be simplified and the text difficulty coefficient of the target text, the simplified text is simplified, so that the degree of simplification of the text is controllable and simplification needs of different users are met.
  • the first text difficulty coefficient includes at least one of a lexical difficulty coefficient, a syntactic difficulty coefficient and a length difficulty coefficient of the first text;
  • the target text difficulty coefficient includes at least one of a lexical difficulty coefficient, a syntactic difficulty coefficient and a length difficulty coefficient of the target text.
  • the obtaining the difficulty coefficient of the target text includes:
  • At least one of the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient of the target text input by the user is received, and the target text difficulty coefficient is obtained.
  • the obtaining the difficulty coefficient of the target text includes:
  • the identity information including the knowledge level of the user
  • At least one of the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient of the target text is determined to obtain the target text difficulty coefficient.
  • determining the lexical difficulty coefficient of the first text includes:
  • the corpus sample includes at least one participle
  • the corpus sample is input into a pre-trained word embedding model, and the pre-trained word embedding model outputs the frequency of each participle in the corpus sample appearing in the corpus sample, and based on each participle and the corresponding The frequency forms a dictionary;
  • the average of the frequencies of the at least one text word segmentation is used as the lexical difficulty coefficient of the first text.
  • determining the syntactic difficulty coefficient of the first text includes:
  • the number of levels included in the syntax tree is determined, and the number of levels is used as a syntax difficulty coefficient of the first text.
  • determining the length difficulty coefficient of the first text includes:
  • the simplified module 42 is set to:
  • the text difficulty control coefficient and the first text are input into the target text simplification model, and the target text simplification model outputs the target text corresponding to the target text difficulty coefficient.
  • the target text simplification model is obtained in the following manner:
  • the training samples include complex samples and simple samples corresponding to the complex samples;
  • the initial text simplification model whose similarity is greater than or equal to the set threshold is recorded as the target text simplification model.
  • the device may also include:
  • the display module is configured to, after simplifying the first text according to the first text difficulty coefficient and the target text difficulty coefficient to obtain a target text corresponding to the target text difficulty coefficient, display the first text and the target text difficulty coefficient.
  • target text
  • a first mark is added to the word to be deleted; in response to determining that the target text contains a replacement word, a second mark is added to the replacement word; in response to determining The target text contains an inserted word to which a third token is added.
  • the text simplification device provided by the embodiment of the present disclosure and the text simplification method provided by the above-mentioned embodiments belong to the same concept.
  • this embodiment please refer to the above-mentioned embodiments, and this embodiment has the same method for performing text simplification. beneficial effect.
  • FIG. 8 it shows a schematic structural diagram of an electronic device 500 suitable for implementing an embodiment of the present disclosure.
  • the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 8 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 500 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 501 that may be loaded into random access according to a program stored in a read only memory (ROM) 502 or from a storage device 508
  • a program in the memory (RAM) 503 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 500 are also stored.
  • the processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to bus 504 .
  • I/O interface 505 input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration
  • An output device 507 such as a computer
  • a storage device 508 including, for example, a magnetic tape, a hard disk, etc.
  • Communication means 509 may allow electronic device 500 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 8 shows electronic device 500 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 509, or from the storage device 508, or from the ROM 502.
  • the processing apparatus 501 When the computer program is executed by the processing apparatus 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the client and server can use any currently known or future developed network protocol such as HTTP (Hyper Text Transfer Protocol) to communicate, and can communicate with digital data in any form or medium.
  • Data communications eg, communication networks
  • Examples of communication networks include local area networks (LANs), wide area networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
  • LANs local area networks
  • WANs wide area networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device causes the electronic device to: obtain the target text difficulty coefficient and the first text to be simplified, and determine the first text to be simplified.
  • the first text difficulty coefficient of a text the target text difficulty coefficient is the simplified text difficulty coefficient of the first text; according to the first text difficulty coefficient and the target text difficulty coefficient, the first text is simplified, The target text corresponding to the difficulty coefficient of the target text is obtained.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware.
  • the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the acquisition module can also be described as "a module for acquiring the difficulty factor of the target text and the first text to be simplified".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the present disclosure provides a text simplification method, including:
  • the first text is simplified to obtain a target text corresponding to the target text difficulty coefficient.
  • the first text difficulty coefficient includes at least one of a lexical difficulty coefficient, a syntactic difficulty coefficient, and a length difficulty coefficient of the first text ;
  • the target text difficulty coefficient includes at least one of a lexical difficulty coefficient, a syntactic difficulty coefficient and a length difficulty coefficient of the target text.
  • the obtaining the difficulty coefficient of the target text includes:
  • At least one of the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient of the target text input by the user is received, and the target text difficulty coefficient is obtained.
  • the obtaining the difficulty coefficient of the target text includes:
  • the identity information including the knowledge level of the user
  • At least one of the lexical difficulty coefficient, the syntactic difficulty coefficient and the length difficulty coefficient of the target text is determined to obtain the target text difficulty coefficient.
  • determining the lexical difficulty coefficient of the first text includes:
  • the corpus sample includes at least one participle
  • the corpus sample is input into a pre-trained word embedding model, and the pre-trained word embedding model outputs the frequency of each participle in the corpus sample appearing in the corpus sample, and based on each participle and the corresponding The frequency forms a dictionary;
  • the average of the frequencies of the at least one text word segmentation is used as the lexical difficulty coefficient of the first text.
  • determining the syntactic difficulty coefficient of the first text includes:
  • the number of levels included in the syntax tree is determined, and the number of levels is used as a syntax difficulty coefficient of the first text.
  • determining the length difficulty coefficient of the first text includes:
  • the first text is simplified according to the first text difficulty coefficient and the target text difficulty coefficient, and the target text is obtained by simplifying the first text.
  • the target text corresponding to the text difficulty coefficient including:
  • the text difficulty control coefficient and the first text are input into the target text simplification model, and the target text simplification model outputs the target text corresponding to the target text difficulty coefficient.
  • the target text simplification model is obtained in the following manner:
  • the training samples include complex samples and simple samples corresponding to the complex samples;
  • the initial text simplification model whose similarity is greater than or equal to the set threshold is recorded as the target text simplification model.
  • the first text is simplified to obtain the same target text as the target text.
  • the target text corresponding to the difficulty factor it also includes:
  • a first mark is added to the word to be deleted; in response to determining that the target text contains a replacement word, a second mark is added to the replacement word; in response to determining The target text contains an inserted word to which a third token is added.
  • the present disclosure provides a text simplification apparatus, including:
  • an acquisition module configured to acquire the target text difficulty coefficient and the first text to be simplified, and determine the first text difficulty coefficient of the first text, where the target text difficulty coefficient is the simplified text difficulty coefficient of the first text;
  • the simplification module is configured to simplify the first text according to the first text difficulty coefficient and the target text difficulty coefficient to obtain a target text corresponding to the target text difficulty coefficient.
  • the present disclosure provides an electronic device, comprising:
  • processors one or more processors
  • memory arranged to store one or more programs
  • a text reduction method as described in any one of the present disclosure is implemented when the one or more programs are executed by the one or more processors.
  • the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the text simplification method according to any one of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil de simplification de texte, un dispositif et un support de stockage. Le procédé comprend : l'acquisition d'un coefficient de difficulté de texte cible et un premier texte à simplifier, et la détermination d'un coefficient de difficulté de premier texte du premier texte (S110), le coefficient de difficulté de texte cible étant un coefficient de difficulté de texte d'un premier texte simplifié ; et la simplification du premier texte en fonction du coefficient de difficulté de premier texte et du coefficient de difficulté de texte cible, de façon à obtenir un texte cible correspondant au coefficient de difficulté de texte cible (S120).
PCT/CN2022/076729 2021-02-20 2022-02-18 Procédé et appareil de simplification de texte, dispositif et support de stockage WO2022174804A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110193908.9 2021-02-20
CN202110193908.9A CN112906372A (zh) 2021-02-20 2021-02-20 一种文本简化方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022174804A1 true WO2022174804A1 (fr) 2022-08-25

Family

ID=76124144

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076729 WO2022174804A1 (fr) 2021-02-20 2022-02-18 Procédé et appareil de simplification de texte, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN112906372A (fr)
WO (1) WO2022174804A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906372A (zh) * 2021-02-20 2021-06-04 北京有竹居网络技术有限公司 一种文本简化方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189262A1 (en) * 2017-01-02 2018-07-05 International Business Machines Corporation Enhancing QA System Cognition With Improved Lexical Simplification Using Multilingual Resources
CN110543639A (zh) * 2019-09-12 2019-12-06 扬州大学 一种基于预训练Transformer语言模型的英文句子简化算法
US20200042547A1 (en) * 2018-08-06 2020-02-06 Koninklijke Philips N.V. Unsupervised text simplification using autoencoders with a constrained decoder
CN110853422A (zh) * 2018-08-01 2020-02-28 世学(深圳)科技有限公司 一种沉浸式语言学习系统及其学习方法
CN111401032A (zh) * 2020-03-09 2020-07-10 腾讯科技(深圳)有限公司 文本处理方法、装置、计算机设备和存储介质
CN112906372A (zh) * 2021-02-20 2021-06-04 北京有竹居网络技术有限公司 一种文本简化方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814066A (zh) * 2009-02-23 2010-08-25 富士通株式会社 文本阅读难度判断设备及其方法
US10460032B2 (en) * 2017-03-17 2019-10-29 International Business Machines Corporation Cognitive lexicon learning and predictive text replacement
US20190114300A1 (en) * 2017-10-13 2019-04-18 Choosito! Inc. Reading Level Based Text Simplification
US11042712B2 (en) * 2018-06-05 2021-06-22 Koninklijke Philips N.V. Simplifying and/or paraphrasing complex textual content by jointly learning semantic alignment and simplicity
CN112364829B (zh) * 2020-11-30 2023-03-24 北京有竹居网络技术有限公司 一种人脸识别方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180189262A1 (en) * 2017-01-02 2018-07-05 International Business Machines Corporation Enhancing QA System Cognition With Improved Lexical Simplification Using Multilingual Resources
CN110853422A (zh) * 2018-08-01 2020-02-28 世学(深圳)科技有限公司 一种沉浸式语言学习系统及其学习方法
US20200042547A1 (en) * 2018-08-06 2020-02-06 Koninklijke Philips N.V. Unsupervised text simplification using autoencoders with a constrained decoder
CN110543639A (zh) * 2019-09-12 2019-12-06 扬州大学 一种基于预训练Transformer语言模型的英文句子简化算法
CN111401032A (zh) * 2020-03-09 2020-07-10 腾讯科技(深圳)有限公司 文本处理方法、装置、计算机设备和存储介质
CN112906372A (zh) * 2021-02-20 2021-06-04 北京有竹居网络技术有限公司 一种文本简化方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112906372A (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
CN111027331B (zh) 用于评估翻译质量的方法和装置
JP7296419B2 (ja) 品質評価モデルを構築するための方法および装置、電子機器、記憶媒体並びにコンピュータプログラム
WO2017024553A1 (fr) Procédé et système d'analyse d'émotion informationnelle
CN111382261B (zh) 摘要生成方法、装置、电子设备及存储介质
US20220374617A1 (en) Document translation method and apparatus, storage medium, and electronic device
JP7335300B2 (ja) 知識事前訓練モデルの訓練方法、装置及び電子機器
CN111046677B (zh) 一种翻译模型的获取方法、装置、设备和存储介质
CN111159220B (zh) 用于输出结构化查询语句的方法和装置
WO2022111347A1 (fr) Procédé et appareil de traitement d'informations, dispositif électronique et support de stockage
WO2021135319A1 (fr) Procédé et appareil de production de texte à base d'apprentissage profond et dispositif électronique
CN113139391B (zh) 翻译模型的训练方法、装置、设备和存储介质
CN113688256B (zh) 临床知识库的构建方法、装置
WO2022166613A1 (fr) Procédé et appareil de reconnaissance de rôle dans un texte, ainsi que support lisible et dispositif électronique
CN114385780B (zh) 程序接口信息推荐方法、装置、电子设备和可读介质
WO2023273598A1 (fr) Procédé et appareil de recherche de texte, support lisible et dispositif électronique
CN111339789B (zh) 一种翻译模型训练方法、装置、电子设备及存储介质
WO2022174804A1 (fr) Procédé et appareil de simplification de texte, dispositif et support de stockage
CN111046168B (zh) 用于生成专利概述信息的方法、装置、电子设备和介质
WO2023138361A1 (fr) Procédé et appareil de traitement d'image, support de stockage lisible et dispositif électronique
CN111555960A (zh) 信息生成的方法
CN111815274A (zh) 信息处理方法、装置和电子设备
WO2022121859A1 (fr) Procédé et appareil de traitement d'informations en une langue parlée et dispositif électronique
CN111368553B (zh) 智能词云图数据处理方法、装置、设备及存储介质
CN112509581B (zh) 语音识别后文本的纠错方法、装置、可读介质和电子设备
CN110852043B (zh) 一种文本转写方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22755577

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22755577

Country of ref document: EP

Kind code of ref document: A1