CN112632955B

CN112632955B - Text set generation method and device, electronic equipment and medium

Info

Publication number: CN112632955B
Application number: CN202011603012.5A
Authority: CN
Inventors: 赵忠信; 张瀚予
Original assignee: Wuba Co Ltd
Current assignee: Wuba Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2023-02-17
Anticipated expiration: 2040-12-29
Also published as: CN112632955A

Abstract

The embodiment of the disclosure discloses a text set generation method, a text set generation device, electronic equipment and a medium. One embodiment of the method comprises: carrying out error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text; constructing a confusion text set related to the labeled target text, wherein each confusion text is a text obtained by modifying the labeled target text by mistake; constructing a directed acyclic graph associated with the target text according to the confusion text set, wherein each path in the directed acyclic graph represents the text after the target text is subjected to word processing; determining each index information of each path in the directed acyclic graph, wherein each index information represents attribute feature information of a text corresponding to each path; and screening the confusion text set based on each index information to obtain a screened text set serving as a correction set of the target text. The embodiment can accurately and efficiently generate the corrected text set related to the target text.

Description

Text set generation method and device, electronic equipment and medium

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a text set generation method, a text set generation device, electronic equipment and a computer readable medium.

Background

With the continuous emergence of electronic text publications such as electronic books, electronic newspapers, electronic mails and office documents, it is more and more important to ensure the correctness of the texts. The research of automatic proofreading of Chinese texts has become an urgent issue to be solved urgently. The commonly used approach is: the grammar rule based method relies on a dictionary base of form and sound which is established manually to carry out character string replacement on the matched error patterns.

However, when the text correction is performed in the above manner, there are often technical problems as follows: high manual construction cost, poor coverage, incapability of identifying semantic-level errors and the like.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a text set generation method, apparatus, electronic device, and computer readable medium to solve one or more of the technical problems set forth in the background section above.

In a first aspect, some embodiments of the present disclosure provide a text set generation method, including: carrying out error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text; constructing a confusion text set related to the labeled target text, wherein each confusion text in the confusion text set is a text obtained by performing error modification on the labeled target text; constructing a directed acyclic graph associated with the target text according to the confusion text set, wherein each path in the directed acyclic graph represents a text subjected to word processing on the target text; determining each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path; and screening the confusion text set based on the index information to obtain a screened text set serving as a correction set of the target text.

In a second aspect, some embodiments of the present disclosure provide an apparatus for generating a text set, including: the marking unit is configured to perform error pre-marking on a pre-acquired target text to be corrected to obtain a marked target text, and perform error pre-marking on the pre-acquired target text to obtain a marked target text; a first construction unit, configured to construct a confusion text set related to the labeled target text, where each confusion text in the confusion text set constructs a confusion text set related to the labeled target text for a text that is incorrectly modified with respect to the labeled target text; a second construction unit, configured to construct a directed acyclic graph associated with the target text according to the obfuscated text set, where each path in the directed acyclic graph represents a text obtained by performing word processing on the target text; a determining unit configured to determine each index information of each path in the directed acyclic graph, where each index information in the each index information represents attribute feature information of a text corresponding to each path to determine each index information of each path in the directed acyclic graph; and the screening unit is configured to screen the confusion text set based on the index information to obtain a screened text set as a correction set of the target text.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement a method as described in any implementation of the first aspect.

In a fourth aspect, some embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, where the program when executed by a processor implements a method as described in any implementation of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: the text set generation method of some embodiments of the present disclosure can accurately and efficiently generate the corrected text set related to the target text. In particular, there are often grammar rule-based methods that rely on a manually established dictionary library of adjacencies and phonetics to perform string replacement on matching error patterns. Based on this, the text set generating method of some embodiments of the present disclosure may first perform error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text. The error pre-labeling method can more comprehensively determine the characters or words which are possibly in error in the target text, and greatly improve the extensibility of a correction set of the subsequent target text. And then, constructing a confusion text set related to the labeled target text, wherein each confusion text in the confusion text set is a text which is obtained by carrying out error modification on the labeled target text. And constructing a directed acyclic graph associated with the target text according to the confusion text set. And each path in the directed acyclic graph represents a text obtained by performing word processing on the target text. Here, the set of obfuscated texts may be visualized through a directed acyclic graph and used to subsequently determine various index information of each obfuscated text in the set of obfuscated texts. And further determining each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path. And finally, screening the confusion text set according to the index information to screen out high-quality confusion texts, and obtaining a screened text set as a correction set of the target text. The method for generating the text set can accurately and efficiently generate the corrected text set related to the target text.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of one application scenario of a corpus-generating method in accordance with some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of a corpus generating method according to the present disclosure;

FIG. 3 is a flow diagram of further embodiments of a corpus generating method according to the present disclosure;

4-6 are schematic diagrams of generating a confusing corpus of text in a corpus generating method according to some embodiments of the present disclosure;

FIG. 7 is a schematic block diagram of some embodiments of a corpus generating apparatus according to the present disclosure;

FIG. 8 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 is a schematic diagram of one application scenario of a text set generation method according to some embodiments of the present disclosure.

In the application scenario of fig. 1, first, the electronic device 101 may perform error pre-labeling on a pre-acquired target text 102 to be corrected to obtain a labeled target text 103. In the present application scenario, the target text 102 may be "a large increase in carbon dioxide emission". The labeled target text may be "TTFTTTFT". Then, an obfuscated text set 104 related to the above noted target text 103 is constructed. Each of the confusing texts in the confusing text set 104 is a text obtained by performing error modification on the labeled target text 103. In the application scenario, the set of obfuscated texts 104 includes: confusing text 1041, confusing text 1042, confusing text 1043, confusing text 1044, confusing text 1045 and confusing text 1046. The confusing text 1041 may be "a large increase in carbon dioxide emission amount". The confusing text 1042 may be "a huge increase in the amount of carbon dioxide emissions". The above obfuscated text 1043 may be: the discharge amount of carbon dioxide is greatly increased. The above-mentioned confusing text 1044 may be "increase in carbon dioxide emission amount". The above confusion text 1045 may be "a large increase in the amount of emission of dioxygen". The confusing text 1046 may be "the increase in the amount of carbon dioxide emission". Further, a directed acyclic graph 105 associated with the target text 102 is constructed according to the obfuscated text set 104, wherein each path in the directed acyclic graph 105 represents a text subjected to word processing on the target text 102. Further, index information of each path in the directed acyclic graph 105 is determined. And each index information in the index information represents attribute characteristic information of the text corresponding to each path. In the application scenario, the index information of the directed acyclic graph 105 with the path "huge increase in carbon dioxide water discharge" may be "confusion degree: 0.8, edit distance: 1 time, the probability corresponding to the path: 0.82". The information of each index having a path of "huge increase in amount of carbon dioxide emissions" in the directed acyclic graph 105 may be "confusion: 0.7, edit distance: 1 time, the probability corresponding to the path: 0.72". The index information for the "sharp increase in carbon dioxide emission" in the path of the directed acyclic graph 105 may be "confusion: 0.5, edit distance: 1 time, the probability corresponding to the path: 0.48". The index information for the path "increase in carbon dioxide emission" in the acyclic graph 105 may be "confusion: 0.6, edit distance: 1 time, the probability corresponding to the path: 0.7". The index information of the acyclic graph 105 whose path is "large increase in the amount of carbon dioxide" may be "confusion: 0.82, edit distance: 1 time, the probability corresponding to the path: 0.79". The index information of the acyclic graph 105 whose path is "the increase in carbon dioxide emissions" may be "confusion: 0.3, edit distance: 1 time, the probability corresponding to the path: 0.42". Finally, the confusing text set 105 is filtered according to the index information, and the filtered text set is obtained as the correction set 107 of the target text 102. In this application scenario, the correction set 107 includes: correction set 1071, correction set 1072, correction set 1073, and correction set 1074. The correction set 1071 is the same as the contents of the confusing text 1041. The correction set 1072 is the same as the obfuscated text 1042. The correction set 1073 is the same as the contents of the confusing text 1044. The correction set 1074 is the same as the obfuscated text 1045.

The electronic device 101 may be hardware or software. When the electronic device is hardware, the electronic device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device. When the electronic device is embodied as software, it can be installed in the hardware devices enumerated above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of electronic devices in fig. 1 is merely illustrative. There may be any number of electronic devices, as desired for implementation.

With continued reference to fig. 2, a flow 200 of some embodiments of a corpus generation method in accordance with the present disclosure is shown. The text set generation method comprises the following steps:

step 201, performing error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text.

In some embodiments, an executing subject (e.g., the electronic device shown in fig. 1) of the text set generating method may perform error pre-labeling on a pre-acquired target text to be corrected, so as to obtain a labeled target text. The target text may be a pre-selected text in which a text error may exist. The error pre-labeling of the target text may be to label a word or a word of the target text, which may have an error. As an example, the target text may be "a large increase in the amount of carbon dioxide emission". Firstly, the possible character errors of the target text are determined to be two characters of 'carbon' and 'huge'. And labeling the target text. Here, the labeling manner may be: the character or word in which the error may occur is replaced with "F", and the remaining characters or words of the target text are replaced with "T". Further, the labeled target text can be obtained as follows: "TTFTTTFT".

As an example, receiving information that a relevant technician can manually perform error pre-tagging on a target text according to a collected text library or word library to perform error pre-tagging on the target text, so as to perform error pre-tagging on the pre-acquired target text.

In some optional implementation manners of some embodiments, the target text is input to a pre-trained target text preprocessing model to obtain the labeled target text. The target text preprocessing model is used for carrying out error prediction and labeling on each word or word in the target text. The target text preprocessing model may be a discriminant (Discriminator) network of a pre-training model of a text coder (electrora, effective Learning an Encoder that is located at classes Token entries authority).

It should be noted that, the target text preprocessing model can quickly and accurately locate an error position, and has strong field mobility and expandability.

Step 202, constructing an obfuscated text set related to the labeled target text.

In some embodiments, the execution body may construct an obfuscated text set associated with the labeled target text. And each confusion text in the confusion text set is a text obtained by performing error modification on the labeled target text.

As an example, the execution subject may randomly replace a predetermined number of words or terms with the target text to obtain the set of obfuscated texts.

As yet another example, the set of obfuscated texts may be a set of obfuscated texts obtained by randomly adding and/or deleting a predetermined number of words or phrases to the target text.

And step 203, constructing a directed acyclic graph associated with the target text according to the confusion text set.

In some embodiments, the execution subject may construct a directed acyclic graph associated with the target text according to the obfuscated text set. And each path in the directed acyclic graph represents a text obtained after the target text is subjected to word processing.

As an example, the execution subject may first take each word or word in the obfuscated sample as a node. Then, according to the context, the negative logarithm of the probability that the word or phrase may appear is determined as the weight corresponding to the word or phrase, i.e., -ln (P (w | context (w))). And constructing a directed acyclic graph with sequential logical association according to the weight. Where P characterizes the probability that the word or phrase may appear. context (w) characterizes context information of the word or word corresponding to w in the target text. The context environment information represents the association relationship between the word or the word corresponding to the w and the context. w represents a word or phrase. It should be noted that the probability of the above-mentioned word or phrase being likely to occur can be determined by Hidden Markov Model (HMM) according to the context.

The directed acyclic graph is constructed to generate more texts obtained by performing word processing on the target text as a correction set of candidate target texts according to the confusion text set.

And step 204, determining each index information of each path in the directed acyclic graph.

In some embodiments, the execution body may receive respective metric information of each path in the related-art person graph. And each index information in the index information represents attribute characteristic information of the text corresponding to each path. As an example, the attribute feature information of the text may be the compliance information of the text.

As an example, a correlation script may be utilized by a correlation technician to determine various metric information for each path in the directed acyclic graph. Furthermore, the executing entity may receive the index information of each path in the directed acyclic graph, which is received by the related art.

In some optional implementations of some embodiments, the confusion, edit distance, and corresponding probability of each path in the directed acyclic graph are determined. The editing distance can measure the number of times of operation required for converting the error text into the target text through operations of adding, deleting and modifying single characters. The smaller the edit distance, the more similar the representation of the error text and the target text. The probability corresponding to the path may be probability information corresponding to the shortest path. The shortest path described above may represent a maximum likelihood estimate of the probability of text occurrence. The confusion may be a measure of how fluent the text is. The less confusing, the more fluent the text. The calculation formula is as follows:

wherein N represents the number of words or phrases associated with the target text. context _bi (w _i ) Denotes w _i Contextual environment information of w _i The ith word or word in sentence w is characterized. P (w) _i |context _bi (w _i ) Characterization of w _i Probability of occurrence in a context environment. PPL (w) characterizes the confusion of the target text.

And step 205, screening the confusion text set based on the index information to obtain a screened text set as a correction set of the target text.

In some embodiments, the executing body may filter the obfuscated text set in various ways according to the index information, and obtain a filtered text set as a correction set of the target text.

In some optional implementation manners of some embodiments, obfuscated texts whose index information respectively satisfies a predetermined condition are selected from the obfuscated text set, so as to obtain a correction set of the target text.

In some optional implementations of some embodiments, the target text is corrected according to the correction set of the target text. As an example, a text having the highest quality of each index information may be selected first in the correction set of the target text described above. Then, the target text is adjusted to the same text as the text of the highest quality.

The above embodiments of the present disclosure have the following beneficial effects: the text set generation method of some embodiments of the present disclosure can accurately and efficiently generate the corrected text set related to the target text. In particular, there are often grammar rule-based methods that rely on a manually established dictionary library of adjacencies and phonetics to perform string replacement on matching error patterns. Based on this, the text set generating method of some embodiments of the present disclosure may first perform error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text. The error pre-labeling method can more comprehensively determine the characters or words which are possibly in error in the target text, and greatly improve the extensibility of a correction set of the subsequent target text. And then, constructing a confusion text set related to the labeled target text, wherein each confusion text in the confusion text set is a text which is obtained by modifying the labeled target text by mistake. And further, constructing a directed acyclic graph associated with the target text according to the confusion text set. And each path in the directed acyclic graph represents a text obtained after the target text is subjected to word processing. Here, the set of obfuscated texts may be visualized through a directed acyclic graph and used to subsequently determine various index information of each obfuscated text in the set of obfuscated texts. And further determining each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path. And finally, screening the confusion text set according to the index information to screen out high-quality confusion texts, and obtaining the screened text set as a correction set of the target text. The method for generating the text set can accurately and efficiently generate the corrected text set related to the target text.

With further reference to fig. 3, a flow 300 of further embodiments of a corpus generating method according to the present disclosure is shown. The text set generation method comprises the following steps:

step 301, performing error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text.

Step 302, adding a first number of masking characters to the associated position of each character or word marked with an error in the marked target text to generate an added text, so as to obtain an added text set.

In some embodiments, an executing subject (e.g., the electronic device shown in fig. 1) may add a first number of masking characters to the associated position of each word or term labeled with an error in the labeled target text to generate an added text, resulting in an added text set. The related position of each word or term labeled incorrectly in the labeled target text may be the position of the word or term on the left or on the back side of the target text.

As an example, as shown in fig. 4, a masking character "[ M ]" may be added to each of the left and right positions of the incorrectly labeled "home" in the labeled target text 402 to generate an added text, resulting in an added text set. The target text 401 is "my hometown is in china". The tagged target text 402 may be "TTFTTTT". The added text set includes: a first additional text 403 and a second additional text 404. The first additional text 403 may be "my [ M ] hometown in China". The second add text 404 may be "my hometown [ M ] in China".

Step 303, replacing each character or word with the first number of masked characters to generate a replacement text, so as to obtain a replacement text set.

In some embodiments, the execution subject may replace each word or term marked as wrong in the marked target text with the first number of masked characters to generate a replacement text, resulting in a replacement text set.

As an example, as shown in fig. 5, the "home" and the "country" marked with errors in the labeled target text 502 may be respectively a masking character "[ M ]" to generate a replacement text, resulting in a replacement text set. The target text 501 is "my hometown is in china". The labeled target text 502 may be "TTFTTTF". The above alternative text set includes: first replacement text 503, second replacement text 504. The first alternative text 503 may be "my [ M ] Country in China". The second alternative text 504 may be "my hometown in M".

And 304, deleting each character or word marked with errors in the marked target text to generate a deleted text, so as to obtain a deleted text set.

In some embodiments, the execution subject may delete each word or term that is incorrectly labeled in the labeled target text to generate a deleted text, resulting in a deleted text set.

As an example, as shown in fig. 6, the "with" and "village" marked with an error in the labeled target text 602 may be deleted to generate a deleted text, resulting in a deleted text set. The target text 601 is "my hometown is in china". The labeled target text 602 may be "tfttt". The deleted text set includes: first deleted text 603, second deleted text 604. The first deletion text 603 may be "i am hometown in china". The second deletion text 604 may be "my home is in china".

And 305, inputting the added text set and the replaced text set into a mask language model trained in advance to obtain a sub-confusion text set.

In some embodiments, the execution subject may input the added text set and the replaced text set to a mask language model trained in advance to obtain a sub-confusion text set. The mask language model is used for predicting words or terms corresponding to the mask words in each text in the added text set and the replaced text set. As an example, the mask Language Model may be a Mask Language Model (MLM) in the BERT Model.

Step 306, determining the sub-confusion text set and the deletion text set as the confusion text set.

In some embodiments, the execution subject may determine the set of sub-obfuscated texts and the set of deleted texts as the set of obfuscated texts.

And 307, constructing a directed acyclic graph associated with the target text according to the confusion text set.

And 308, determining each index information of each path in the directed acyclic graph.

And 309, screening the confusion text set based on the index information to obtain a screened text set serving as a correction set of the target text.

In some embodiments, the specific implementation of

steps

301, 307 to 309 and the technical effect thereof may refer to

steps

201, 203 to 205 in the embodiment corresponding to fig. 2, and are not described herein again.

The embodiment of the disclosure improves the diversity and comprehensiveness of the confusion text set laterally, so that the correction set of the target text is more accurate and reliable. The factors that lead to an insufficiently diverse and comprehensive set of confusing text are often as follows: the construction of the existing confusion text set is often artificially constructed by related technical personnel, and various factors cannot be comprehensively and diversely considered when the confusion text set is artificially generated. Therefore, according to the embodiment of the disclosure, each character or word marked with an error in the marked target text is deleted, added or replaced to quickly construct a more diverse and comprehensive confusing text set. In addition, the construction of the subsequent directed acyclic graph is guaranteed by the various and comprehensive confusion text sets.

With further reference to fig. 7, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a text set generating apparatus, which correspond to those of the method embodiments shown in fig. 2, and which may be applied in various electronic devices in particular.

As shown in fig. 7, a text set generating apparatus 700 includes: an annotation unit 701, a first construction unit 702, a second construction unit 703, a determination unit 704 and a screening unit 705. Wherein the annotation unit 701 is configured to: and carrying out error pre-labeling on the pre-acquired target text to be corrected to obtain a labeled target text. The first building unit 702 is configured to: and constructing a confusion text set related to the labeled target text, wherein each confusion text in the confusion text set is a text obtained by performing error modification on the labeled target text. The second building unit 703 is configured to: and constructing a directed acyclic graph associated with the target text according to the confusion text set, wherein each path in the directed acyclic graph represents the text subjected to word processing on the target text. The determination unit 704 is configured to: and determining each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path. The screening unit 705 is configured to: and screening the confusion text set based on the index information to obtain a screened text set serving as a correction set of the target text.

In some optional implementations of some embodiments, the apparatus 700 may further include: a correction unit (not shown in the figure). Wherein the correction unit is configured to: and correcting the target text according to the correction set of the target text.

In some optional implementations of some embodiments, the labeling unit 701 in the apparatus 700 may be further configured to: and inputting the target text into a pre-trained target text preprocessing model to obtain the labeled target text, wherein the target text preprocessing model is used for carrying out error prediction and labeling on each word or word in the target text.

In some optional implementations of some embodiments, the first building unit 702 in the apparatus 700 may be further configured to: adding a first number of masking characters to the associated position of each character or word marked with errors in the marked target text to generate an added text, so as to obtain an added text set; replacing each character or word with the first number of masked characters to generate a replacement text, so as to obtain a replacement text set; deleting each character or word which is marked incorrectly in the marked target text to generate a deleted text, so as to obtain a deleted text set; inputting the added text set and the replacement text set into a mask language model trained in advance to obtain a sub-confusion text set, wherein the mask language model is used for predicting words or terms corresponding to mask words in each text in the added text set and the replacement text set; and determining the sub-confusion text set and the deletion text set as the confusion text set.

In some optional implementations of some embodiments, the determining unit 704 in the apparatus 700 may be further configured to: and determining the confusion degree and the editing distance of each path in the directed acyclic graph and the probability corresponding to the path.

In some optional implementations of some embodiments, the filtering unit 705 in the apparatus 700 may be further configured to: and selecting the confusion texts with the index information meeting the preset conditions from the confusion text set to obtain the correction set of the target text.

It will be understood that the elements described in the apparatus 700 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to the method are also applicable to the apparatus 700 and the units included therein, and are not described herein again.

Referring now to fig. 8, a schematic diagram of an electronic device (e.g., the electronic device of fig. 1) 800 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, or the like; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 8 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through communications device 809, or installed from storage device 808, or installed from ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: carrying out error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text; constructing a confusion text set related to the labeled target text, wherein each confusion text in the confusion text set is a text obtained by performing error modification on the labeled target text; constructing a directed acyclic graph associated with the target text according to the confusion text set, wherein each path in the directed acyclic graph represents a text subjected to word processing on the target text; determining each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path; and screening the confusion text set based on the index information to obtain a screened text set serving as a correction set of the target text.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, which may be described as: a processor includes an annotation unit, a first construction unit, a second construction unit, a determination unit, and a screening unit. The names of the units do not form a limitation on the units themselves in some cases, for example, the labeling unit may also be described as a unit for performing error pre-labeling on a pre-acquired target text to obtain a labeled target text.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A text set generation method comprises the following steps:

carrying out error pre-labeling on a pre-acquired target text to be corrected to obtain a labeled target text;

adding a first number of masking characters to the associated position of each character or word marked with errors in the marked target text to generate an added text, so as to obtain an added text set;

replacing each character or word with the first number of masked characters to generate a replacement text, so as to obtain a replacement text set;

deleting each character or word which is marked incorrectly in the marked target text to generate a deleted text, so as to obtain a deleted text set;

inputting the added text set and the replacement text set into a mask language model trained in advance to obtain a sub-confusion text set, wherein the mask language model is used for predicting words or terms corresponding to mask words in each text in the added text set and the replacement text set;

determining the sub-confusion text set and the deleted text set as the confusion text set, wherein each confusion text in the confusion text set is a text which is obtained by performing error modification on a labeled target text;

constructing a directed acyclic graph associated with the target text according to the confusion text set, wherein each path in the directed acyclic graph represents a text subjected to word processing on the target text;

determining each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path;

and screening the confusion text set based on the index information to obtain a screened text set serving as a correction set of the target text.

2. The method of claim 1, wherein the method further comprises:

and correcting the target text according to the correction set of the target text.

3. The method of claim 1, wherein the performing the error pre-labeling on the pre-acquired target text to be corrected to obtain the labeled target text comprises:

and inputting the target text into a pre-trained target text preprocessing model to obtain the labeled target text, wherein the target text preprocessing model is used for carrying out error prediction and labeling on each word or word in the target text.

4. The method of claim 1, wherein the determining respective metric information for each path in the directed acyclic graph comprises:

and determining the confusion degree, the edit distance and the probability corresponding to each path in the directed acyclic graph.

5. The method according to claim 1, wherein the screening the confusing text set based on the index information to obtain a screened text set as the correction set of the target text includes:

and selecting the confusion texts of which the index information meets the preset conditions from the confusion text set to obtain the correction set of the target text.

6. A text set generating apparatus comprising:

the marking unit is configured to perform error pre-marking on a pre-acquired target text to be corrected to obtain a marked target text;

the first construction unit is configured to add a first number of masking characters to the associated position of each character or word marked with errors in the marked target text to generate an added text, so as to obtain an added text set; replacing each character or word marked wrongly in the marked target text with the first number of masking characters to generate a replacement text, so as to obtain a replacement text set; deleting each character or word marked wrongly in the marked target text to generate a deleted text, so as to obtain a deleted text set; inputting the added text set and the replacement text set into a mask language model trained in advance to obtain a sub-confusion text set, wherein the mask language model is used for predicting words or terms corresponding to mask words in each text in the added text set and the replacement text set; determining the sub-confusion text set and the deleted text set as the confusion text set, wherein each confusion text in the confusion text set is a text which is obtained by performing error modification on a labeled target text;

a second construction unit, configured to construct a directed acyclic graph associated with the target text according to the obfuscated text set, where each path in the directed acyclic graph represents a text after word processing is performed on the target text;

the determining unit is configured to determine each index information of each path in the directed acyclic graph, wherein each index information in each index information represents attribute feature information of a text corresponding to each path;

and the screening unit is configured to screen the confusion text set based on the index information to obtain a screened text set as a correction set of the target text.

7. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-5.

8. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-5.