WO2021212612A1 - 智能化文本纠错方法、装置、电子设备及可读存储介质 - Google Patents

智能化文本纠错方法、装置、电子设备及可读存储介质 Download PDF

Info

Publication number
WO2021212612A1
WO2021212612A1 PCT/CN2020/093557 CN2020093557W WO2021212612A1 WO 2021212612 A1 WO2021212612 A1 WO 2021212612A1 CN 2020093557 W CN2020093557 W CN 2020093557W WO 2021212612 A1 WO2021212612 A1 WO 2021212612A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
vector
error correction
correction model
predicted
Prior art date
Application number
PCT/CN2020/093557
Other languages
English (en)
French (fr)
Inventor
谢静文
阮晓雯
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021212612A1 publication Critical patent/WO2021212612A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device, electronic equipment, and readable storage medium for intelligent text error correction.
  • Text error correction has broad application prospects, such as intelligent error correction and prompting of cumbersome text in the medical field, speeding up the work efficiency of prescription printers, correcting spelled text during office chat, and preventing low-level errors.
  • the inventor realizes that there are two main existing technologies for text error correction, one is the traditional text error correction model obtained by using the distance calculation method; the second is the deep learning text error correction model obtained by training with a large corpus. Both methods can complete text error correction to a certain extent, but the deep learning text error correction model requires a large amount of corpus in the training phase. Whether it is corpus collection, cleaning and subsequent training, it consumes more labor and computer resources. Traditional The robustness of the text error correction model is poor, and for some specific scenes, the effect of weak text error correction is not ideal, especially the text in the medical field.
  • This application provides an intelligent text error correction method, device, electronic equipment, and computer readable storage medium, the main purpose of which is to solve the problem of improving the effect of text error correction without excessively consuming manual and computer resources.
  • an intelligent text error correction method provided by this application includes:
  • Obtain the text to be corrected perform a text masking operation on the text to be corrected to obtain one or more sets of masked text, and input the masked text into the standard text error correction model to obtain the predicted text and the The predicted probability value of the predicted text;
  • this application also provides an intelligent text error correction device, which includes:
  • the unsupervised training module is used to perform label calculation on the current information set to obtain the current label set according to the correspondence between the historical information set and the historical label set;
  • a supervised training module configured to perform label adjustment on the current label set according to a preset adjustment factor to obtain a standard label set
  • the predictive text module is used to extract label features from the standard label set according to the convolutional neural network feature extraction technology to obtain a feature extraction set;
  • the text error correction module is used to perform classification prediction using the feature extraction set as the input value of the trained classification neural network to obtain an information classification result.
  • an electronic device which includes:
  • Memory storing at least one instruction
  • the processor executes the instructions stored in the memory to implement the intelligent text error correction method as described below:
  • Obtain the text to be corrected perform a text masking operation on the text to be corrected to obtain one or more sets of masked text, and input the masked text into the standard text error correction model to obtain the predicted text and the The predicted probability value of the predicted text;
  • the present application also provides a computer-readable storage medium having at least one instruction stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in an electronic device to implement the following Intelligent text error correction method:
  • Obtain the text to be corrected perform a text masking operation on the text to be corrected to obtain one or more sets of masked text, and input the masked text into the standard text error correction model to obtain the predicted text and the The predicted probability value of the predicted text;
  • This application uses unlabeled text sets to perform unsupervised training and supervised training on the pre-built original text error correction model, and predicts the text through the text masking operation and the model completed by the training. Because unsupervised training does not require a lot of labor and computer Resources are marked and cleaned up, and the pre-built original text error correction model is based on deep learning. For some specific scenarios, the text error correction ability is strong. Therefore, the intelligent text error correction method, device, electronic device, and computer-readable storage medium proposed in this application can solve the problem of improving the effect of text error correction without excessively consuming manual and computer resources.
  • FIG. 1 is a schematic flowchart of an intelligent text error correction method provided by an embodiment of this application
  • step S1 is a detailed flowchart of step S1 in the intelligent text error correction method provided by an embodiment of this application;
  • FIG. 3 is a detailed flowchart of step S2 in the intelligent text error correction method provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of modules of an intelligent text error correction method provided by an embodiment of this application.
  • FIG. 5 is a schematic diagram of the internal structure of an electronic device of an intelligent text error correction method provided by an embodiment of the application;
  • This application provides an intelligent text error correction method.
  • FIG. 1 it is a schematic flowchart of an intelligent text error correction method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the intelligent text error correction method includes:
  • Text error correction has a wide range of application scenarios, especially in the medical field. Due to the complexity of the wording, many medical books and prescriptions will have word errors. For example, the patient Zhang Qiang suffers from seborrheic dermatitis, and the doctor has prescribed compound ketoconazole For the prescription of the lotion, the prescription printer prints the compound ketoconazole hair lotion into the compound tonconazole hair lotion due to a mistake, then the technical solution of this application can be used for intelligent error correction at this time.
  • the unmarked text set is a set of unlabeled texts.
  • the above compound ketoconazole hair lotion and compound tonconazole hair lotion are actually unmarked texts, which simply means that the compound ketoconazole hair lotion is written correctly. Form, and the compound Tongconazole hair lotion is the wrong writing form, but it is not given the correct label.
  • the S1 may refer to the detailed flowchart of the steps in FIG. 2 as shown in the following:
  • S12 Perform position labeling on the unmarked text set to obtain a text position set, and convert the text position set into a position vector set according to the text vector conversion method;
  • the text vector conversion method may adopt, for example, one-hot word vector conversion and Word2Vec word vector conversion.
  • the preferred embodiment of the present application adopts Word2Vec word vector conversion, and the Word2Vec word vector conversion includes:
  • represents the decision tree path value on which the Word2Vec word vector conversion depends
  • j represents the index of the unmarked text set, and is a positive integer
  • ⁇ ( ⁇ ,j) represents that under the path ⁇ , the unmarked
  • the text vector of the jth unmarked text in the text set or the position vector of the jth text position in the text position set Indicates the Huffman code corresponding to the j-th node in the path ⁇ , where the path ⁇ is a positive integer, ⁇ is the iteration factor of the Word2Vec word vector conversion, ⁇ is the sigmoid function, and X ⁇ is the unlabeled text Set or set of said text positions.
  • the original correct compound ketoconazole hair lotion can be transformed into a text vector and a position vector after the above vector conversion.
  • the text vector is for example [1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04]
  • the position vector Is [0.11,1.09,3.59,0.4,0.75,2.1,5.1,2.09,3.77]
  • the original text error correction model is improved based on a BERT model (Bidirectional Encoder Representations from Transformer, BERT for short).
  • the inputting the position vector set and the text vector set to the original text error correction model for unsupervised training includes: dividing the text vector set with data in the vector set as a division unit, and dividing Multiple sets of word vector sets are generated, the text vector sets are divided into units of behavioral divisions, and multiple sets of paragraph vector sets are divided, and the weights of each set of word vector sets, each set of paragraph vector sets, and the position vector sets are calculated Relationship, updating the internal parameters of the original text error correction model according to the weight relationship.
  • the text vector of the above compound ketoconazole hair lotion is [1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04], and the position vector is [0.11,1.09,3.59,0.4,0.75,2.1,5.1, 2.09,3.77], if the text vector is divided by data, multiple sets of word vectors in the form of [0.75,2.1], [1.6,2.81,1.04], [0.3,17.03,2.81,1.04], etc. can be obtained, If the text vector of the above compound ketoconazole hair lotion is multi-line, for example, the expression form is Then divide by line to obtain two sets of paragraph vectors [1.6,1.23] and [6.91,9.4].
  • the calculating the weight relationship of each group of the word vector set, each group of the paragraph vector set, and the position vector set includes: randomly selecting the word vector set, the paragraph vector set, and the position A vector in any vector set in the vector set is used as a target vector, a text masking operation is used on the target vector to obtain a masking vector, the weights of the masking vector and the vectors in each vector set are calculated to obtain a weight set, and the weight set is performed Weighted fusion obtains the weight relationship.
  • the compound ketoconazole hair lotion has a word vector of [0.3,17.03,2.81,1.04] selected as the target vector
  • the text masking operation is to block arbitrary data, such as [0.3,17.03,2.81,1.04] for text
  • the masking operation becomes [0.3,**,2.81,**]
  • the weight set is obtained by calculating the weights of [0.3,**,2.81,**] and other word vectors, paragraph vectors, and position vectors.
  • the similarity calculation method can be used to calculate the weight of the shielding vector and the vector in each vector set, and the similarity calculation method can use the currently published cosine calculation method, Euclidean distance method, etc.
  • the weighted fusion can adopt a Gaussian distribution form of fusion method, a linear method (such as a linear function), and a nonlinear method (such as a quadratic function).
  • a linear method such as a linear function
  • a nonlinear method such as a quadratic function
  • the weight set is [0.101,3.091,2.057,0.4,0.756,2.71,5.103, ], using a linear function to perform fusion to obtain the k value and b value of the linear function, and then use the k value and b value as the internal parameters of the original text error correction model.
  • the marked text set corresponds to the unmarked text set, and the marked text set is a labeled text set.
  • the compound ketoconazole hair lotion, the compound tonconazole hair lotion, etc. can be unlabeled Mark the text. Even though the compound tonconazole hair lotion is the wrong way to write, but the text has been marked, the compound ketoconazole hair lotion is labeled with the correct writing, and the wrong text is generally not used.
  • the supervised training is the same as the basic form of the unsupervised training.
  • the supervised training of the primary text error correction model using the labeled text set obtains the standard text error correction model. Please refer to step S2 in FIG. 3 for details.
  • the process diagram is shown, including:
  • the compound tonconazole hair lotion is the text to be corrected, as described in "Compound ketoconazole hair lotion”
  • the text masking operation can get the masked texts such as “Compound **Zole Hair Lotion”, “*Fang Tongkang* Hair Lotion”, “Compound Tongconazole Hair Lotion”, etc.
  • inputting the masked text into the standard text error correction model to obtain the predicted text and the predicted probability value of the predicted text includes: according to the text vector conversion method, converting the masked text into The masked vector is input into the standard text error correction model to obtain the predicted text and the predicted probability value of the predicted text.
  • the predicted text is not the same as the text to be corrected, determine whether the predicted probability value is greater than the preset probability value, and if the predicted probability value is less than the preset probability value, there is no need to correct the error to be corrected.
  • the text performs text error correction and re-receives the text to be corrected.
  • this application considers that the accuracy of the predicted text does not meet the requirements, and therefore does not perform text error correction on the text to be corrected.
  • the predicted text "compound ketoconazole hair lotion” is used to replace the to-be-corrected text "compound tonconazole hair lotion” to complete the text error correction.
  • the above-mentioned text and text collection can also be stored in a node of a blockchain.
  • This solution can be applied to sub-fields such as smart healthcare and smart education in the smart city field, thereby promoting the construction of smart cities.
  • FIG. 4 it is a functional block diagram of the intelligent text error correction device of the present application.
  • the intelligent text error correction device 100 described in this application can be installed in an electronic device.
  • the intelligent text error correction device may include an unsupervised training module 101, a supervised training module 102, a predictive text module 103, and a text error correction module 104.
  • the module described in the present invention can also be called a unit, which refers to a series of computer program segments that can be executed by the processor of an electronic device and can complete fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the unsupervised training module 101 is configured to perform label calculation on the current information set to obtain the current label set according to the correspondence between the historical information set and the historical label set;
  • the supervised training module 102 is configured to perform label adjustment on the current label set according to a preset adjustment factor to obtain a standard label set;
  • the predictive text module 103 is configured to extract label features from the standard label set according to the convolutional neural network feature extraction technology to obtain a feature extraction set;
  • the text error correction module 104 is configured to use the feature extraction set as the input value of the trained classification neural network to perform classification prediction to obtain an information classification result.
  • FIG. 5 it is a schematic structural diagram of an electronic device implementing an intelligent text error correction method according to the present application.
  • the electronic device 1 may include a processor 10, a memory 11, and a bus, and may also include a computer program stored in the memory 11 and running on the processor 10, such as an intelligent text error correction program 12.
  • the memory 11 includes at least one type of readable storage medium, the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the electronic device 1, such as intelligent text error correction codes, etc., but also to temporarily store data that has been output or will be output.
  • the readable storage medium may be non-volatile or volatile.
  • the processor 10 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips, etc.
  • the processor 10 is the control unit of the electronic device, which uses various interfaces and lines to connect the various components of the entire electronic device, and runs or executes programs or modules stored in the memory 11 (such as executing Intelligent text error correction, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnection standard (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • PCI peripheral component interconnection standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection and communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 5 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation on the electronic device 1, and may include fewer or more components than shown in the figure. Components, or a combination of certain components, or different component arrangements.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 10 through a power management device, thereby controlling power
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include a variety of sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • the intelligent text error correction 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions. When running in the processor 10, it can realize:
  • Unsupervised training is performed on the pre-built original text error correction model using the unlabeled text set, and the primary text error correction model is obtained.
  • the marked text set is used to supervise and train the primary text error correction model to obtain a standard text error correction model.
  • Obtain the text to be corrected perform a text masking operation on the text to be corrected to obtain one or more sets of masked text, and input the masked text into the standard text error correction model to obtain the predicted text and the The predicted probability value of the predicted text.
  • Step 1 Use the unlabeled text set to perform unsupervised training on the pre-built original text error correction model to obtain the primary text error correction model.
  • Text error correction has a wide range of application scenarios, especially in the medical field. Due to the complexity of the wording, many medical books and prescriptions will have word errors. For example, the patient Zhang Qiang suffers from seborrheic dermatitis, and the doctor has prescribed compound ketoconazole For the prescription of the lotion, the prescription printer prints the compound ketoconazole hair lotion into the compound tonconazole hair lotion due to a mistake, then the technical solution of this application can be used for intelligent error correction at this time.
  • the unmarked text set is a set of unlabeled texts.
  • the above compound ketoconazole hair lotion and compound tonconazole hair lotion are actually unmarked texts, which simply means that the compound ketoconazole hair lotion is written correctly. Form, and the compound Tongconazole hair lotion is the wrong writing form, but it is not given the correct label.
  • the first step includes:
  • the position vector set and the text vector set are input into the original text error correction model for unsupervised training until the number of training times of the unsupervised training meets a preset training requirement, and the training is exited to obtain a primary text error correction model.
  • the text vector conversion method may adopt, for example, one-hot word vector conversion and Word2Vec word vector conversion.
  • the preferred embodiment of the present application adopts Word2Vec word vector conversion, and the Word2Vec word vector conversion includes:
  • represents the decision tree path value on which the Word2Vec word vector conversion depends
  • j represents the index of the unmarked text set, and is a positive integer
  • ⁇ ( ⁇ ,j) represents the unmarked text under the path ⁇
  • Set the text vector of the j-th unmarked text or the position vector of the j-th text position in the set of text positions Indicates the Huffman code corresponding to the j-th node in the path ⁇
  • the path ⁇ is a positive integer
  • is the iteration factor of the Word2Vec word vector conversion
  • is the sigmoid function
  • X ⁇ is the unlabeled text set Or the set of text positions.
  • the original correct compound ketoconazole hair lotion can be transformed into a text vector and a position vector after the above vector conversion.
  • the text vector is for example [1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04]
  • the position vector Is [0.11,1.09,3.59,0.4,0.75,2.1,5.1,2.09,3.77]
  • the original text error correction model is improved based on a BERT model (Bidirectional Encoder Representations from Transformer, BERT for short).
  • the inputting the position vector set and the text vector set to the original text error correction model for unsupervised training includes: dividing the text vector set with data in the vector set as a division unit, and dividing Multiple sets of word vector sets are generated, the text vector sets are divided into units of behavioral divisions, and multiple sets of paragraph vector sets are divided, and the weights of each set of word vector sets, each set of paragraph vector sets, and the position vector sets are calculated Relationship, updating the internal parameters of the original text error correction model according to the weight relationship.
  • the text vector of the above compound ketoconazole hair lotion is [1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04], and the position vector is [0.11,1.09,3.59,0.4,0.75,2.1,5.1, 2.09,3.77], if the text vector is divided by data, multiple sets of word vectors in the form of [0.75,2.1], [1.6,2.81,1.04], [0.3,17.03,2.81,1.04], etc. can be obtained, If the text vector of the above compound ketoconazole hair lotion is multi-line, for example, the expression form is Then divide by line to obtain two sets of paragraph vectors [1.6,1.23] and [6.91,9.4].
  • the calculating the weight relationship of each group of the word vector set, each group of the paragraph vector set, and the position vector set includes: randomly selecting the word vector set, the paragraph vector set, and the position A vector in any vector set in the vector set is used as a target vector, a text masking operation is used on the target vector to obtain a masking vector, the weights of the masking vector and the vectors in each vector set are calculated to obtain a weight set, and the weight set is performed Weighted fusion obtains the weight relationship.
  • the compound ketoconazole hair lotion has a word vector of [0.3,17.03,2.81,1.04] selected as the target vector
  • the text masking operation is to block arbitrary data, such as [0.3,17.03,2.81,1.04] for text
  • the masking operation becomes [0.3,**,2.81,**]
  • the weight set is obtained by calculating the weights of [0.3,**,2.81,**] and other word vectors, paragraph vectors, and position vectors.
  • a similarity calculation method can be used to calculate the weight of the shielding vector and the vector in each vector set, and the similarity calculation method can use the currently published cosine calculation method, Euclidean distance method, etc.
  • the weighted fusion can adopt a Gaussian distribution form of fusion method, a linear method (such as a linear function), and a nonlinear method (such as a quadratic function).
  • a linear method such as a linear function
  • a nonlinear method such as a quadratic function
  • the weight set is [0.101,3.091,2.057,0.4,0.756,2.71,5.103, ], using a linear function to perform fusion to obtain the k value and b value of the linear function, and then use the k value and b value as the internal parameters of the original text error correction model.
  • Step 2 Use the marked text set to supervise and train the primary text error correction model to obtain a standard text error correction model.
  • the marked text set corresponds to the unmarked text set, and the marked text set is a labeled text set.
  • compound ketoconazole hair lotion, compound tonconazole hair lotion, etc. can be Unmarked text, even if the compound ketoconazole hair lotion is the wrong way of writing, but the text is marked, the compound ketoconazole hair lotion is labeled with the correct writing, and the wrongly written text is generally not used.
  • the supervised training is the same as the basic form of the unsupervised training.
  • the supervised training of the primary text error correction model using the labeled text set to obtain a standard text error correction model includes:
  • the text vector conversion method converting the marked text set into a marked text vector set
  • Step 3 Obtain the text to be corrected, perform a text masking operation on the text to be corrected to obtain one or more sets of masked text, and input the masked text into the standard text error correction model to obtain the predicted text and The predicted probability value of the predicted text.
  • the compound tonconazole hair lotion is the text to be corrected, as described in "Compound ketoconazole hair lotion”
  • the text masking operation can get the masked texts such as “Compound **Zole Hair Lotion”, “*Fang Tongkang* Hair Lotion”, “Compound Tongconazole Hair Lotion”, etc.
  • inputting the masked text into the standard text error correction model to obtain the predicted text and the predicted probability value of the predicted text includes: according to the text vector conversion method, converting the masked text into The masked vector is input into the standard text error correction model to obtain the predicted text and the predicted probability value of the predicted text.
  • Step 4 Determine whether the predicted text is the same as the text to be corrected.
  • Step 5 If the predicted text is the same as the text to be corrected, there is no need to perform text correction on the text to be corrected and receive the text to be corrected again.
  • Step 6 If the predicted text is not the same as the text to be corrected, judge whether the predicted probability value is greater than the preset probability value. If the predicted probability value is less than the preset probability value, there is no need to The wrong text performs text error correction and re-receives the text to be corrected.
  • this application considers that the accuracy of the predicted text does not meet the requirements, and therefore does not perform text error correction on the text to be corrected.
  • Step 7 If the predicted probability value is greater than the preset probability value, perform text error correction on the text to be corrected according to the predicted text.
  • the predicted text "compound ketoconazole hair lotion” is used to replace the to-be-corrected text "compound tonconazole hair lotion” to complete the text error correction.
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer readable storage medium.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

一种智能化文本纠错方法、装置、电子设备以及一种计算机可读存储介质,涉及人工智能技术,所述方法包括:利用未标记文本集对原始文本纠错模型进行非监督训练得到初级文本纠错模型,利用已标记文本集对所述初级文本纠错模型进行监督训练得到标准文本纠错模型,对纠错文本执行文本遮蔽操作得到已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值,在所述预测文本与所述待纠错文本不相同且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。可以解决在不过度消耗人工和计算机资源的前提下提高文本纠错效果的问题。此外,还涉及区块链技术,所述文本、文本集可存储于区块链中。

Description

智能化文本纠错方法、装置、电子设备及可读存储介质
本申请要求于2020年04月23日提交中国专利局、申请号为202010329725.0、发明名称为“智能化文本纠错方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种智能化文本纠错的方法、装置、电子设备及可读存储介质。
背景技术
文本纠错具有广阔的应用前景,如医药领域对于繁琐文字的智能化纠错并提示,加快处方打印员的工作效率、办公聊天时对拼写出的文本进行纠错,防止产生低级错误等等。
发明人意识到现有关于文本纠错的技术主要有两种,一、使用距离计算法得到的传统文本纠错模型;二、使用大型语料库训练得到的深度学习文本纠错模型。两种方法在一定程度上都可以完成文本纠错,但深度学习文本纠错模型在训练阶段需要大量的语料库,不管是语料库收集、清理直至后续的训练过程,都较消耗人工和计算机资源,传统文本纠错模型的鲁棒性较差、对于某些特定场景,文本纠错能力弱效果不理想,特别是医学领域的文本。
发明内容
本申请提供一种智能化文本纠错方法、装置、电子设备及计算机可读存储介质,其主要目的在于解决在不过度消耗人工和计算机资源的前提下,提高文本纠错效果的问题。
为实现上述目的,本申请提供的一种智能化文本纠错方法,包括:
利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型;
利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型;
获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值;
在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
为了解决上述问题,本申请还提供一种智能化文本纠错装置,所述装置包括:
非监督训练模块,用于根据历史信息集与历史标签集的对应关系,对当前信息集进行标签计算得到当前标签集;
监督训练模块,用于根据预设调节因子对所述当前标签集进行标签调节得到标准标签集;
预测文本模块,用于根据卷积神经网络特征提取技术,从所述标准标签集中提取标签特征得到特征提取集;
文本纠错模块,用于将所述特征提取集作为已训练完成的分类神经网络的输入值进行分类预测得到信息分类结果。
为了解决上述问题,本申请还提供一种电子设备,所述电子设备包括:
存储器,存储至少一个指令;及
处理器,执行所述存储器中存储的指令以实现如下所述的智能化文本纠错方法:
利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错 模型;
利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型;
获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值;
在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
为了解决上述问题,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下所述的智能化文本纠错方法:
利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型;
利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型;
获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值;
在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
本申请根据利用未标记文本集对预构建的原始文本纠错模型分别进行非监督训练和监督训练,并通过文本遮蔽操作及训练完成的模型预测文本,由于非监督训练不需要消耗大量人工、计算机资源进行标记及清理,且预构建的原始文本纠错模型基于深度学习为基础,对于某些特定场景,文本纠错能力强效果。因此本申请提出的智能化文本纠错方法、装置、电子设备及计算机可读存储介质,可以解决在不过度消耗人工和计算机资源的前提下,提高文本纠错效果的问题。
附图说明
图1为本申请一实施例提供的智能化文本纠错方法的流程示意图;
图2为本申请一实施例提供的智能化文本纠错方法中S1步骤的详细流程示意图;
图3为本申请一实施例提供的智能化文本纠错方法中S2步骤的详细流程示意图;
图4为本申请一实施例提供的智能化文本纠错方法的模块示意图;
图5为本申请一实施例提供的智能化文本纠错方法的电子设备的内部结构示意图;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供一种智能化文本纠错方法。参照图1所示,为本申请一实施例提供的智能化文本纠错方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。
在本实施例中,智能化文本纠错方法包括:
S1、利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型。
文本纠错具有广大的应用场景,特别是在医学领域上,由于用词复杂,导致很多医学书籍、处方会有用词错误,如患者张强患有脂溢性皮炎,医生开了关于复方酮康唑发用洗剂的处方,处方打印员由于失误将复方酮康唑发用洗剂打印成复方桐康唑发用洗剂,则此时使用本申请的技术方案可进行智能纠错。
所述未标记文本集是没有添加标签的文本集,如上述复方酮康唑发用洗剂、复方桐康唑发用洗剂等其实都是未标记文本,简单理解为复方酮康唑发用洗剂是正确的撰写形式,而复方桐康唑发用洗剂是错误的撰写形式,但都并未给予是否撰写正确的标签。
详细地,所述S1可参阅图2步骤的详细流程示意图所示包括:
S11、根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集;
S12、对所述未标记文本集进行位置标注得到文本位置集,根据所述文本向量转换方法,将所述文本位置集转化为位置向量集;
S13、将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,直至所述非监督训练的训练次数满足预设训练要求,退出训练得到初级文本纠错模型。
进一步地,所述文本向量转换方法可采用如one-hot词向量转换、Word2Vec词向量转换。
本申请较佳实施例采用Word2Vec词向量转换,所述Word2Vec词向量转换包括:
采用如下方式对所述未标记文本集或所述文本位置集进行向量转换:
Figure PCTCN2020093557-appb-000001
其中,ω表示所述Word2Vec词向量转换依赖的决策树路径值,j表示所述未标记文本集的索引,且为正整数,,ζ(ω,j)表示在路径ω下,所述未标记文本集第j个未标记文本的文本向量或所述文本位置集第j个文本位置的位置向量,
Figure PCTCN2020093557-appb-000002
表示在路径ω内,第j个结点对应的霍夫曼编码,,路径ω为正整数,θ为所述Word2Vec词向量转换的迭代因子,σ表示sigmoid函数,X ω为所述未标记文本集或所述文本位置集。
如原来正确的复方酮康唑发用洗剂,经过上述向量转化变为文本向量和位置向量,其中文本向量比如为[1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04],位置向量为[0.11,1.09,3.59,0.4,0.75,2.1,5.1,2.09,3.77]
较佳地,所述原始文本纠错模型以BERT模型(BidirectionalEncoderRepresentations from Transformer,简称BERT)为基础改进得到的。
详细地,所述将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,包括:将所述文本向量集以向量集内的数据为划分单位,划分出多组词语向量集,将所述文本向量集以行为划分单位,划分出多组段落向量集,计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,根据所述权重关系更新所述原始文本纠错模型的内部参数。
如上述复方酮康唑发用洗剂的文本向量是[1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04],位置向量为[0.11,1.09,3.59,0.4,0.75,2.1,5.1,2.09,3.77],若对文本向量以数据为划分单位,则可得到[0.75,2.1]、[1.6,2.81,1.04]、[0.3,17.03,2.81,1.04]等形式的多组词语向量集,若上述复方酮康唑发用洗剂的文本向量是多行,比如表现形式为
Figure PCTCN2020093557-appb-000003
则按行划分得到[1.6,1.23]、[6.91,9.4]两组段落向量集。
进一步地,所述计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,包括:随机选择所述词语向量集、所述段落向量集、所述位置向量集中任意一个向量集内的向量作为目标向量,对所述目标向量使用文本遮蔽操作得到遮蔽向量,计算所述遮蔽向量与每个向量集内向量的权重得到权重集,将所述权重集进行加权融合得到所述权重关系。
如复方酮康唑发用洗剂有一个[0.3,17.03,2.81,1.04]的词语向量选择作为目标向量,所述文本遮蔽操作是遮挡任意的数据,如将[0.3,17.03,2.81,1.04]进行文本遮蔽操作变为[0.3,**,2.81,**],则通过计算[0.3,**,2.81,**]与其他词语向量、段落向量、位置向量的权重得到权重集。
详细地,计算遮蔽向量与每个向量集内向量的权重可采用相似度计算方法,相似度计 算方法可采用当前已公开的余弦计算法、欧式距离法等。
所述加权融合可采用高斯分布形式的融合方法、线性方式(如一次函数)、非线性方式(如二次函数),如权重集为[0.101,3.091,2.057,0.4,0.756,2.71,5.103,],利用一次函数进行融合得到一次函数的k值和b值,进而将k值和b值作为所述原始文本纠错模型的内部参数。
S2、利用已标记文本集对所述初级文本纠错模型进行监督训练得到标准文本纠错模型。
所述已标记文本集与所述未标记文本集是对应的,已标记文本集是添加标签的文本集,如S1所述,复方酮康唑发用洗剂、复方桐康唑发用洗剂等可以为未标记文本,即使复方桐康唑发用洗剂是错误的撰写方式,但已标记文本集中,将复方酮康唑发用洗剂添加了正确撰写的标签,且一般不使用撰写错误的文本。
所述监督训练与所述非监督训练基本形式相同,详细地,所述利用已标记文本集对所述初级文本纠错模型进行监督训练得到标准文本纠错模型请参阅图3中S2步骤的详细流程示意图所示,包括:
S21、从所述已标记文本集中提取已标记文本的标签得到真实标签集;
S22、根据所述文本向量转换方法,将所述已标记文本集转化为已标记文本向量集;
S23、将所述已标记文本向量集输入至所述初级文本纠错模型进行监督训练得到预测标签集;
S24、判断所述预测标签集与所述真实标签集的误差范围是否大于预设误差,若所述预测标签集与所述真实标签集的误差范围大于所述预设误差,则继续进行所述监督训练,直至所述预测标签集与所述真实标签集的误差范围小于预设误差,退出所述监督训练得到标准文本纠错模型。
S3、获取待纠错文本,对所述待纠错文本进行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中得到预测文本及所述预测文本的预测概率值。
如上述处方打印员由于失误将复方酮康唑发用洗剂打印成复方桐康唑发用洗剂,则复方桐康唑发用洗剂即为待纠错文本,如对“复方桐康唑发用洗剂”进行所述文本遮蔽操作可得到“复方**唑发用洗剂”、“*方桐康*发用洗剂”、“复方桐康唑发用**”等已遮蔽文本。
详细地,将所述已遮蔽文本输入至所述标准文本纠错模型中得到预测文本及所述预测文本的预测概率值,包括:根据所述文本向量转换方法,将所述已遮蔽文本转化为已遮蔽向量,将所述已遮蔽向量输入至所述标准文本纠错模型中得到预测文本及所述预测文本的预测概率值。
S4、判断所述预测文本与所述待纠错文本是否相同。
如上述对“复方**唑发用洗剂”进行预测时,得到的预测文本为“复方酮康唑发用洗剂”,则判断预测文本“复方酮康唑发用洗剂”与待纠错文本“复方桐康唑发用洗剂”是否相同。
S5、若所述预测文本与所述待纠错文本相同,不需要对所述待纠错文本进行文本纠错并重新接收待纠错文本。
若上述预测文本“复方桐康唑发用洗剂”与待纠错文本“复方桐康唑发用洗剂”一样,则证明未发现处方打印员的失误。
S6、若所述预测文本与所述待纠错文本不相同,判断所述预测概率值是否大于预设概率值,若所述预测概率值小于预设概率值,不需要对所述待纠错文本进行文本纠错并重新接收待纠错文本。
若上述预测文本“复方酮康唑发用洗剂”与待纠错文本“复方桐康唑发用洗剂”不相同,且预测文本“复方酮康唑发用洗剂”的预测概率为97%。
若所述预设概率值为99%,则本申请认为预测文本精度未达到要求,因此不对待纠错 文本进行文本纠错。
S7、若所述预测概率值大于所述预设概率值,根据所述预测文本对所述待纠错文本进行文本纠错。
若所述预设概率值为96%,则用预测文本“复方酮康唑发用洗剂”代替待纠错文本“复方桐康唑发用洗剂”,完成文本纠错。
需要强调的是,为进一步保证上述数据的私密和安全性,上述文本、文本集还可以存储于一区块链的节点中。
本方案可应用于智慧城市领域中的智慧医疗、智慧教育等等子领域中,从而推动智慧城市的建设。
如图4所示,是本申请智能化文本纠错装置的功能模块图。
本申请所述智能化文本纠错装置100可以安装于电子设备中。根据实现的功能,所述智能化文本纠错装置可以包括非监督训练模块101、监督训练模块102、预测文本模块103和文本纠错模块104。本发所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。
在本实施例中,关于各模块/单元的功能如下:
所述非监督训练模块101,用于根据历史信息集与历史标签集的对应关系,对当前信息集进行标签计算得到当前标签集;
所述监督训练模块102,用于根据预设调节因子对所述当前标签集进行标签调节得到标准标签集;
所述预测文本模块103,用于根据卷积神经网络特征提取技术,从所述标准标签集中提取标签特征得到特征提取集;
所述文本纠错模块104,用于将所述特征提取集作为已训练完成的分类神经网络的输入值进行分类预测得到信息分类结果。
如图5所示,是本申请实现智能化文本纠错方法的电子设备的结构示意图。
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如智能化文本纠错程序12。
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如智能化文本纠错的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。所述可读存储介质可以是非易失性,也可以是易失性。
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如执行智能化文本纠错等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总 线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。
图5仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图5示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。
所述电子设备1中的所述存储器11存储的智能化文本纠错12是多个指令的组合,在所述处理器10中运行时,可以实现:
利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型。
利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型。
获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值。
在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
具体地,所述处理器10对上述指令的具体实现方法如下:
步骤一、利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型。
文本纠错具有广大的应用场景,特别是在医学领域上,由于用词复杂,导致很多医学书籍、处方会有用词错误,如患者张强患有脂溢性皮炎,医生开了关于复方酮康唑发用洗剂的处方,处方打印员由于失误将复方酮康唑发用洗剂打印成复方桐康唑发用洗剂,则此时使用本申请的技术方案可进行智能纠错。
所述未标记文本集是没有添加标签的文本集,如上述复方酮康唑发用洗剂、复方桐康唑发用洗剂等其实都是未标记文本,简单理解为复方酮康唑发用洗剂是正确的撰写形式,而复方桐康唑发用洗剂是错误的撰写形式,但都并未给予是否撰写正确的标签。
详细地,所述步骤一包括:
根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集;
对所述未标记文本集进行位置标注得到文本位置集,根据所述文本向量转换方法,将 所述文本位置集转化为位置向量集;
将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,直至所述非监督训练的训练次数满足预设训练要求,退出训练得到初级文本纠错模型。
进一步地,所述文本向量转换方法可采用如one-hot词向量转换、Word2Vec词向量转换。
本申请较佳实施例采用Word2Vec词向量转换,所述Word2Vec词向量转换包括:
采用如下方式对所述未标记文本集或所述文本位置集进行向量转换:
Figure PCTCN2020093557-appb-000004
其中,ω表示所述Word2Vec词向量转换依赖的决策树路径值,j表示所述未标记文本集的索引,且为正整数,ζ(ω,j)表示在路径ω下,所述未标记文本集第j个未标记文本的文本向量或所述文本位置集第j个文本位置的位置向量,
Figure PCTCN2020093557-appb-000005
表示在路径ω内,第j个结点对应的霍夫曼编码,路径ω为正整数,θ为所述Word2Vec词向量转换的迭代因子,σ表示sigmoid函数,X ω为所述未标记文本集或所述文本位置集。
如原来正确的复方酮康唑发用洗剂,经过上述向量转化变为文本向量和位置向量,其中文本向量比如为[1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04],位置向量为[0.11,1.09,3.59,0.4,0.75,2.1,5.1,2.09,3.77]
较佳地,所述原始文本纠错模型以BERT模型(BidirectionalEncoder Representations from Transformer,简称BERT)为基础改进得到的。
详细地,所述将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,包括:将所述文本向量集以向量集内的数据为划分单位,划分出多组词语向量集,将所述文本向量集以行为划分单位,划分出多组段落向量集,计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,根据所述权重关系更新所述原始文本纠错模型的内部参数。
如上述复方酮康唑发用洗剂的文本向量是[1.6,1.23,6.91,9.4,12.7,0.3,17.03,2.81,1.04],位置向量为[0.11,1.09,3.59,0.4,0.75,2.1,5.1,2.09,3.77],若对文本向量以数据为划分单位,则可得到[0.75,2.1]、[1.6,2.81,1.04]、[0.3,17.03,2.81,1.04]等形式的多组词语向量集,若上述复方酮康唑发用洗剂的文本向量是多行,比如表现形式为
Figure PCTCN2020093557-appb-000006
则按行划分得到[1.6,1.23]、[6.91,9.4]两组段落向量集。
进一步地,所述计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,包括:随机选择所述词语向量集、所述段落向量集、所述位置向量集中任意一个向量集内的向量作为目标向量,对所述目标向量使用文本遮蔽操作得到遮蔽向量,计算所述遮蔽向量与每个向量集内向量的权重得到权重集,将所述权重集进行加权融合得到所述权重关系。
如复方酮康唑发用洗剂有一个[0.3,17.03,2.81,1.04]的词语向量选择作为目标向量,所述文本遮蔽操作是遮挡任意的数据,如将[0.3,17.03,2.81,1.04]进行文本遮蔽操作变为[0.3,**,2.81,**],则通过计算[0.3,**,2.81,**]与其他词语向量、段落向量、位置向量的权重得到权重集。
详细地,计算遮蔽向量与每个向量集内向量的权重可采用相似度计算方法,相似度计算方法可采用当前已公开的余弦计算法、欧式距离法等。
所述加权融合可采用高斯分布形式的融合方法、线性方式(如一次函数)、非线性方式(如二次函数),如权重集为[0.101,3.091,2.057,0.4,0.756,2.71,5.103,],利用一次函数进行融合得到一次函数的k值和b值,进而将k值和b值作为所述原始文本纠错模型的内部参数。
步骤二、利用已标记文本集对所述初级文本纠错模型进行监督训练得到标准文本纠错模型。
所述已标记文本集与所述未标记文本集是对应的,已标记文本集是添加标签的文本集,如步骤一所述,复方酮康唑发用洗剂、复方桐康唑发用洗剂等可以为未标记文本,即使复方桐康唑发用洗剂是错误的撰写方式,但已标记文本集中,将复方酮康唑发用洗剂添加了正确撰写的标签,且一般不使用撰写错误的文本。
所述监督训练与所述非监督训练基本形式相同,详细地,所述利用已标记文本集对所述初级文本纠错模型进行监督训练得到标准文本纠错模型包括:
从所述已标记文本集中提取已标记文本的标签得到真实标签集;
根据所述文本向量转换方法,将所述已标记文本集转化为已标记文本向量集;
将所述已标记文本向量集输入至所述初级文本纠错模型进行监督训练得到预测标签集;
判断所述预测标签集与所述真实标签集的误差范围是否大于预设误差,若所述预测标签集与所述真实标签集的误差范围大于所述预设误差,则继续进行所述监督训练,直至所述预测标签集与所述真实标签集的误差范围小于预设误差,退出所述监督训练得到标准文本纠错模型。
步骤三、获取待纠错文本,对所述待纠错文本进行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中得到预测文本及所述预测文本的预测概率值。
如上述处方打印员由于失误将复方酮康唑发用洗剂打印成复方桐康唑发用洗剂,则复方桐康唑发用洗剂即为待纠错文本,如对“复方桐康唑发用洗剂”进行所述文本遮蔽操作可得到“复方**唑发用洗剂”、“*方桐康*发用洗剂”、“复方桐康唑发用**”等已遮蔽文本。
详细地,将所述已遮蔽文本输入至所述标准文本纠错模型中得到预测文本及所述预测文本的预测概率值,包括:根据所述文本向量转换方法,将所述已遮蔽文本转化为已遮蔽向量,将所述已遮蔽向量输入至所述标准文本纠错模型中得到预测文本及所述预测文本的预测概率值。
步骤四、判断所述预测文本与所述待纠错文本是否相同。
如上述对“复方**唑发用洗剂”进行预测时,得到的预测文本为“复方酮康唑发用洗剂”,则判断预测文本“复方酮康唑发用洗剂”与待纠错文本“复方桐康唑发用洗剂”是否相同。
步骤五、若所述预测文本与所述待纠错文本相同,不需要对所述待纠错文本进行文本纠错并重新接收待纠错文本。
若上述预测文本“复方桐康唑发用洗剂”与待纠错文本“复方桐康唑发用洗剂”一样,则证明未发现处方打印员的失误。
步骤六、若所述预测文本与所述待纠错文本不相同,判断所述预测概率值是否大于预设概率值,若所述预测概率值小于预设概率值,不需要对所述待纠错文本进行文本纠错并重新接收待纠错文本。
若上述预测文本“复方酮康唑发用洗剂”与待纠错文本“复方桐康唑发用洗剂”不相同,且预测文本“复方酮康唑发用洗剂”的预测概率为97%。
若所述预设概率值为99%,则本申请认为预测文本精度未达到要求,因此不对待纠错文本进行文本纠错。
步骤七、若所述预测概率值大于所述预设概率值,根据所述预测文本对所述待纠错文本进行文本纠错。
若所述预设概率值为96%,则用预测文本“复方酮康唑发用洗剂”代替待纠错文本“复方桐康唑发用洗剂”,完成文本纠错。
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性计算机可读取存储介质中。所述计算机 可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种智能化文本纠错方法,其中,所述方法包括:
    利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型;
    利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型;
    获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值;
    在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
  2. 如权利要求1所述的智能化文本纠错方法,其中,所述利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型,包括:
    根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集;
    对所述未标记文本集进行位置标注得到文本位置集;
    根据所述文本向量转换方法,将所述文本位置集转化为位置向量集;
    将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,直至所述非监督训练的训练次数满足预设训练要求,退出训练得到初级文本纠错模型。
  3. 如权利要求2所述的智能化文本纠错方法,其中,所述根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集,包括:
    采用如下转换方法,将所述未标记文本集转化为文本向量集:
    Figure PCTCN2020093557-appb-100001
    其中,ω表示基于所述文本向量转换方法的文本决策树的路径值,j表示所述未标记文本集的索引,且为正整数,ζ(ω,j)表示在路径ω下,所述未标记文本集第j个未标记文本的文本向量,
    Figure PCTCN2020093557-appb-100002
    表示在路径ω内第j个结点对应的霍夫曼编码,路径ω为正整数,θ为所述文本向量转换方法的迭代因子,σ表示sigmoid函数,X ω为所述未标记文本集。
  4. 如权利要求2所述的智能化文本纠错方法,其中,所述将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,包括:
    将所述文本向量集以向量数据为划分单位,划分出多组词语向量集;
    将所述文本向量集以向量的行为划分单位,划分出多组段落向量集;
    计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,根据所述权重关系更新所述原始文本纠错模型的内部参数。
  5. 如权利要求4所述的智能化文本纠错方法,其中,所述计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,包括:
    依次选择所述词语向量集、所述段落向量集、所述位置向量集中任意一个向量作为目标向量;
    对所述目标向量执行所述文本遮蔽操作,得到遮蔽向量;
    计算所述遮蔽向量与所述词语向量集、所述段落向量集、所述位置向量集中内其他向量的权重,得到权重集,将所述权重集进行加权融合得到所述权重关系。
  6. 如权利要求2至5中任意一项所述的智能化文本纠错方法,其中,所述利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型,包括:
    从所述已标记文本集中提取已标记文本的标签,得到真实标签集;
    根据所述文本向量转换方法,将所述已标记文本集转化为已标记文本向量集;
    将所述已标记文本向量集输入至所述初级文本纠错模型进行监督训练,得到预测标签集;
    若所述预测标签集与所述真实标签集的误差范围大于预设误差,继续进行所述监督训练,直至所述预测标签集与所述真实标签集的误差范围小于所述预设误差,退出所述监督训练,得到所述标准文本纠错模型。
  7. 如权利要求1至5中任意一项所述的智能化文本纠错方法,其中,该方法还包括:
    在所述预测文本与所述待纠错文本相同时,重新接收待纠错文本;或,
    在所述预测文本与所述待纠错文本不相同,且所述预测概率值小于预设概率值时,重新接收待纠错文本。
  8. 一种智能化文本纠错装置,其中,所述装置包括:
    非监督训练模块,用于根据历史信息集与历史标签集的对应关系,对当前信息集进行标签计算得到当前标签集;
    监督训练模块,用于根据预设调节因子对所述当前标签集进行标签调节得到标准标签集;
    预测文本模块,用于根据卷积神经网络特征提取技术,从所述标准标签集中提取标签特征得到特征提取集;
    文本纠错模块,用于将所述特征提取集作为已训练完成的分类神经网络的输入值进行分类预测得到信息分类结果。
  9. 一种电子设备,其中,所述电子设备包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下所述的智能化文本纠错方法:
    利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型;
    利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型;
    获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值;
    在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
  10. 如权利要求9所述的电子设备,其中,所述利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型,包括:
    根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集;
    对所述未标记文本集进行位置标注得到文本位置集;
    根据所述文本向量转换方法,将所述文本位置集转化为位置向量集;
    将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,直至所述非监督训练的训练次数满足预设训练要求,退出训练得到初级文本纠错模型。
  11. 如权利要求10所述的电子设备,其中,所述根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集,包括:
    采用如下转换方法,将所述未标记文本集转化为文本向量集:
    Figure PCTCN2020093557-appb-100003
    其中,ω表示基于所述文本向量转换方法的文本决策树的路径值,j表示所述未标记文本集的索引,且为正整数,ζ(ω,j)表示在路径ω下,所述未标记文本集第j个未标记文本的文本向量,
    Figure PCTCN2020093557-appb-100004
    表示在路径ω内第j个结点对应的霍夫曼编码,路径ω为正整数,θ为所 述文本向量转换方法的迭代因子,σ表示sigmoid函数,X ω为所述未标记文本集。
  12. 如权利要求10所述的电子设备,其中,所述将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,包括:
    将所述文本向量集以向量数据为划分单位,划分出多组词语向量集;
    将所述文本向量集以向量的行为划分单位,划分出多组段落向量集;
    计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,根据所述权重关系更新所述原始文本纠错模型的内部参数。
  13. 如权利要求12所述的电子设备,其中,所述计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,包括:
    依次选择所述词语向量集、所述段落向量集、所述位置向量集中任意一个向量作为目标向量;
    对所述目标向量执行所述文本遮蔽操作,得到遮蔽向量;
    计算所述遮蔽向量与所述词语向量集、所述段落向量集、所述位置向量集中内其他向量的权重,得到权重集,将所述权重集进行加权融合得到所述权重关系。
  14. 如权利要求10至13中任意一项所述的电子设备,其中,所述利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型,包括:
    从所述已标记文本集中提取已标记文本的标签,得到真实标签集;
    根据所述文本向量转换方法,将所述已标记文本集转化为已标记文本向量集;
    将所述已标记文本向量集输入至所述初级文本纠错模型进行监督训练,得到预测标签集;
    若所述预测标签集与所述真实标签集的误差范围大于预设误差,继续进行所述监督训练,直至所述预测标签集与所述真实标签集的误差范围小于所述预设误差,退出所述监督训练,得到所述标准文本纠错模型。
  15. 如权利要求9至13中任意一项所述的电子设备,其中,该方法还包括:
    在所述预测文本与所述待纠错文本相同时,重新接收待纠错文本;或,
    在所述预测文本与所述待纠错文本不相同,且所述预测概率值小于预设概率值时,重新接收待纠错文本。
  16. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下所述的智能化文本纠错方法:
    利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型;
    利用已标记文本集对所述初级文本纠错模型进行监督训练,得到标准文本纠错模型;
    获取待纠错文本,对所述待纠错文本执行文本遮蔽操作得到一组或多组已遮蔽文本,将所述已遮蔽文本输入至所述标准文本纠错模型中,得到预测文本及所述预测文本的预测概率值;
    在所述预测文本与所述待纠错文本不相同,且所述预测概率值大于所述预设概率值时,根据所述预测文本对所述待纠错文本进行文本纠错。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述利用未标记文本集对预构建的原始文本纠错模型进行非监督训练,得到初级文本纠错模型,包括:
    根据预构建的文本向量转换方法,将所述未标记文本集转化为文本向量集;
    对所述未标记文本集进行位置标注得到文本位置集;
    根据所述文本向量转换方法,将所述文本位置集转化为位置向量集;
    将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,直至所述非监督训练的训练次数满足预设训练要求,退出训练得到初级文本纠错模型。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述根据预构建的文本向量 转换方法,将所述未标记文本集转化为文本向量集,包括:
    采用如下转换方法,将所述未标记文本集转化为文本向量集:
    Figure PCTCN2020093557-appb-100005
    其中,ω表示基于所述文本向量转换方法的文本决策树的路径值,j表示所述未标记文本集的索引,且为正整数,ζ(ω,j)表示在路径ω下,所述未标记文本集第j个未标记文本的文本向量,
    Figure PCTCN2020093557-appb-100006
    表示在路径ω内第j个结点对应的霍夫曼编码,路径ω为正整数,θ为所述文本向量转换方法的迭代因子,σ表示sigmoid函数,X ω为所述未标记文本集。
  19. 如权利要求17所述的计算机可读存储介质,其中,所述将所述位置向量集和所述文本向量集输入至所述原始文本纠错模型进行非监督训练,包括:
    将所述文本向量集以向量数据为划分单位,划分出多组词语向量集;
    将所述文本向量集以向量的行为划分单位,划分出多组段落向量集;
    计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,根据所述权重关系更新所述原始文本纠错模型的内部参数。
  20. 如权利要求19所述的计算机可读存储介质,其中,所述计算每组所述词语向量集、每组所述段落向量集及所述位置向量集的权重关系,包括:
    依次选择所述词语向量集、所述段落向量集、所述位置向量集中任意一个向量作为目标向量;
    对所述目标向量执行所述文本遮蔽操作,得到遮蔽向量;
    计算所述遮蔽向量与所述词语向量集、所述段落向量集、所述位置向量集中内其他向量的权重,得到权重集,将所述权重集进行加权融合得到所述权重关系。
PCT/CN2020/093557 2020-04-23 2020-05-30 智能化文本纠错方法、装置、电子设备及可读存储介质 WO2021212612A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010329725.0 2020-04-23
CN202010329725.0A CN111626047A (zh) 2020-04-23 2020-04-23 智能化文本纠错方法、装置、电子设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021212612A1 true WO2021212612A1 (zh) 2021-10-28

Family

ID=72271736

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093557 WO2021212612A1 (zh) 2020-04-23 2020-05-30 智能化文本纠错方法、装置、电子设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN111626047A (zh)
WO (1) WO2021212612A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398876A (zh) * 2022-03-24 2022-04-26 北京沃丰时代数据科技有限公司 一种基于有限状态转换器的文本纠错方法和装置
CN114792085A (zh) * 2022-06-22 2022-07-26 中科雨辰科技有限公司 一种标注文本纠错的数据处理系统
CN117875313A (zh) * 2024-03-12 2024-04-12 长沙市智为信息技术有限公司 一种中文语法纠错方法及系统

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214602B (zh) * 2020-10-23 2023-11-10 中国平安人寿保险股份有限公司 基于幽默度的文本分类方法、装置、电子设备及存储介质
CN112269875B (zh) * 2020-10-23 2023-07-25 中国平安人寿保险股份有限公司 文本分类方法、装置、电子设备及存储介质
CN112650843A (zh) * 2020-12-23 2021-04-13 平安银行股份有限公司 问答对知识库的构建方法、装置、设备及存储介质
CN112905737B (zh) * 2021-01-28 2023-07-28 平安科技(深圳)有限公司 文本纠错方法、装置、设备及存储介质
CN113010635B (zh) * 2021-02-19 2023-05-26 网易(杭州)网络有限公司 一种文本纠错方法及装置
CN113515934A (zh) * 2021-04-28 2021-10-19 新东方教育科技集团有限公司 文本纠错方法、装置、存储介质及电子设备
CN113807973B (zh) * 2021-09-16 2023-07-25 平安科技(深圳)有限公司 文本纠错方法、装置、电子设备及计算机可读存储介质
CN115169330B (zh) * 2022-07-13 2023-05-02 平安科技(深圳)有限公司 中文文本纠错及验证方法、装置、设备及存储介质
CN115630634B (zh) * 2022-12-08 2023-03-14 深圳依时货拉拉科技有限公司 文本纠错方法、装置、电子设备以及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075844A1 (en) * 2016-09-09 2018-03-15 Electronics And Telecommunications Research Institute Speech recognition system and method
CN108021931A (zh) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 一种数据样本标签处理方法及装置
CN108959252A (zh) * 2018-06-28 2018-12-07 中国人民解放军国防科技大学 基于深度学习的半监督中文命名实体识别方法
CN110619119A (zh) * 2019-07-23 2019-12-27 平安科技(深圳)有限公司 文本智能编辑方法、装置及计算机可读存储介质
CN111046652A (zh) * 2019-12-10 2020-04-21 拉扎斯网络科技(上海)有限公司 文本纠错方法、文本纠错装置、存储介质和电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075844A1 (en) * 2016-09-09 2018-03-15 Electronics And Telecommunications Research Institute Speech recognition system and method
CN108021931A (zh) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 一种数据样本标签处理方法及装置
CN108959252A (zh) * 2018-06-28 2018-12-07 中国人民解放军国防科技大学 基于深度学习的半监督中文命名实体识别方法
CN110619119A (zh) * 2019-07-23 2019-12-27 平安科技(深圳)有限公司 文本智能编辑方法、装置及计算机可读存储介质
CN111046652A (zh) * 2019-12-10 2020-04-21 拉扎斯网络科技(上海)有限公司 文本纠错方法、文本纠错装置、存储介质和电子设备

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398876A (zh) * 2022-03-24 2022-04-26 北京沃丰时代数据科技有限公司 一种基于有限状态转换器的文本纠错方法和装置
CN114398876B (zh) * 2022-03-24 2022-06-14 北京沃丰时代数据科技有限公司 一种基于有限状态转换器的文本纠错方法和装置
CN114792085A (zh) * 2022-06-22 2022-07-26 中科雨辰科技有限公司 一种标注文本纠错的数据处理系统
CN114792085B (zh) * 2022-06-22 2022-09-16 中科雨辰科技有限公司 一种标注文本纠错的数据处理系统
CN117875313A (zh) * 2024-03-12 2024-04-12 长沙市智为信息技术有限公司 一种中文语法纠错方法及系统

Also Published As

Publication number Publication date
CN111626047A (zh) 2020-09-04

Similar Documents

Publication Publication Date Title
WO2021212612A1 (zh) 智能化文本纠错方法、装置、电子设备及可读存储介质
WO2021212682A1 (zh) 知识抽取方法、装置、电子设备及存储介质
WO2022142593A1 (zh) 文本分类方法、装置、电子设备及可读存储介质
WO2021151345A1 (zh) 识别模型的参数获取方法、装置、电子设备及存储介质
CN110795938B (zh) 文本序列分词方法、装置及存储介质
WO2021073390A1 (zh) 数据筛选方法、装置、设备及计算机可读存储介质
WO2021208703A1 (zh) 问题解析方法、装置、电子设备及存储介质
WO2022222300A1 (zh) 开放关系抽取方法、装置、电子设备及存储介质
CN113157927B (zh) 文本分类方法、装置、电子设备及可读存储介质
CN111539211A (zh) 实体及语义关系识别方法、装置、电子设备及存储介质
CN113704429A (zh) 基于半监督学习的意图识别方法、装置、设备及介质
WO2023159755A1 (zh) 虚假新闻检测方法、装置、设备及存储介质
CN113360654B (zh) 文本分类方法、装置、电子设备及可读存储介质
WO2022194062A1 (zh) 疾病标签检测方法、装置、电子设备及存储介质
CN113807973B (zh) 文本纠错方法、装置、电子设备及计算机可读存储介质
CN112988963A (zh) 基于多流程节点的用户意图预测方法、装置、设备及介质
CN111475645B (zh) 知识点标注方法、装置及计算机可读存储介质
WO2023137906A1 (zh) 文档标题生成方法、装置、设备及存储介质
CN113627160B (zh) 文本纠错方法、装置、电子设备及存储介质
CN113344125B (zh) 长文本匹配识别方法、装置、电子设备及存储介质
CN113204698A (zh) 新闻主题词生成方法、装置、设备及介质
CN116341646A (zh) Bert模型的预训练方法、装置、电子设备及存储介质
WO2022141867A1 (zh) 语音识别方法、装置、电子设备及可读存储介质
CN115146064A (zh) 意图识别模型优化方法、装置、设备及存储介质
CN112434157B (zh) 文书多标签分类方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20932581

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20932581

Country of ref document: EP

Kind code of ref document: A1