CN110619119B

CN110619119B - Intelligent text editing method and device and computer readable storage medium

Info

Publication number: CN110619119B
Application number: CN201910668831.9A
Authority: CN
Inventors: 乔佳
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2022-06-10
Anticipated expiration: 2039-07-23
Also published as: CN110619119A

Abstract

The invention relates to an artificial intelligence technology, and discloses an intelligent text editing method, which comprises the following steps: acquiring a correct text set and an error text set, preprocessing the error text set to obtain a standard error text set, and establishing a corresponding label set for the correct text set and the standard error text set; converting the correct text set and the standard error text set into word vectors through a word bag model, and storing the word vectors as a training set into a corpus; training a pre-constructed text intelligent editing model by utilizing the training set and the label set to obtain a trained text intelligent editing model; and receiving text data input by a user, intelligently editing the text data input by the user by using the trained text intelligent editing model, and outputting corresponding correct text data. The invention also provides an intelligent text editing device and a computer readable storage medium. The invention realizes intelligent editing of the text.

Description

Intelligent text editing method and device and computer readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a text intelligent editing method and device for text error correction and a computer readable storage medium.

Background

With the increasing informatization of society, people increasingly strongly want to interact with computers in natural language. Natural language processing is an attractive and challenging topic in computer science. From the point of view of computer science and in particular artificial intelligence, the task of natural language processing is to build a computer model that gives the human-like results of understanding, analyzing and answering natural language (the various popular languages that people use daily).

Natural language processing is to study how to make a computer understand and generate the language (such as chinese and english) that people use everyday, so that the computer can understand the meaning of natural language and answer questions that people put forward to the computer with natural language by means of dialogue. In addition, in the aspect of text editing, natural language processing has a great potential to correct text errors, and the existing correction method is a language proofreading method, wherein a voice synthesis system reads an input sentence, and then an input person or a proofreader checks the original manuscript. The method can find out the difference between the input manuscript and the original manuscript, reduce the workload of proofreading, but can not find out homophonic miswords, has no error prompt function, and can not find out punctuation errors in the original manuscript.

Disclosure of Invention

The invention provides a text intelligent editing method, a text intelligent editing device and a computer readable storage medium, and mainly aims to present an intelligent text editing method for a user when the user edits a text.

In order to achieve the above object, the present invention provides an intelligent text editing method, which includes:

receiving a correct text set and an error text set, carrying out preprocessing operation on the error text set to obtain a standard error text set, and establishing a corresponding label set for the correct text set and the standard error text set;

converting the correct text set and the standard error text set into word vectors through a word bag model, and storing the word vectors as a training set into a corpus;

inputting the training set and the label set into a pre-constructed intelligent text editing model, training the intelligent text editing model by using the training set to obtain a training value, inputting the training value and the label set into a loss function of the intelligent text editing model to obtain a loss function value, and quitting training of the intelligent text editing model when the loss function value is smaller than a preset threshold value;

and receiving text data input by a user, intelligently editing the text data input by the user by using the intelligent text editing model, and outputting corresponding correct text data.

Optionally, the preprocessing operation comprises:

performing word segmentation processing on the error text set to obtain a word segmentation result, and performing punctuation correction on the error text set by using the word segmentation result and according to a punctuation correction rule to obtain an error punctuation set in the error text set;

and utilizing the word binary continuing relation to check the word continuing relation near the target word string of the error text set by establishing an N-gram model to obtain the error word string of the error text set.

Optionally, the word segmentation processing includes:

segmenting the error text set by using a full segmentation method to obtain a plurality of word segmentation modes;

and calculating the probability of each word segmentation mode according to Markov, and selecting the word segmentation result of the word segmentation mode with the highest probability as the word segmentation result of the error text set.

Optionally, the converting the correct text set and the standard error text set into word vectors by a bag of words model includes:

calculating the distance between the data objects of the correct text set and the standard error text set by an Euclidean formula, and presetting n class clusters according to a clustering algorithm, wherein the class cluster Center of the kth class cluster is a Center_kCalculating the distance from each data of the correct text set and the standard error text set to the center of each cluster of the n clusters, and obtaining the characteristics of each data in the center of each cluster;

and training the features by using a classifier, and calculating the probability of each data in the center of the class cluster, so as to convert the correct text set and the standard error text set into word vectors.

Optionally, the training the intelligent text editing model by using the training set to obtain a training value includes:

inputting the training set into an input layer of a convolutional neural network of the text intelligent editing model, and performing convolution operation on the training set through a group of filters preset in the convolutional neural network convolutional layer to extract a feature vector;

and performing pooling operation on the feature vectors by using a pooling layer of the convolutional neural network, inputting the pooled feature vectors to a full-connection layer, and performing normalization processing and calculation on the pooled feature vectors through an activation function to obtain a training value.

In addition, in order to achieve the above object, the present invention further provides a text intelligent editing apparatus, which includes a memory and a processor, where the memory stores a text intelligent editing program that can run on the processor, and when the text intelligent editing program is executed by the processor, the following steps are implemented:

Optionally, the preprocessing operation comprises:

performing word segmentation processing on the error text set to obtain a word segmentation result, and performing punctuation correction on the error text set by using the word segmentation result and according to punctuation correction rules to obtain an error punctuation set in the error text set;

and utilizing the word binary continuing relation to check the word continuing relation near the target word string of the error text set by establishing an N-gram model to obtain an error word string set of the error text set.

Optionally, the word segmentation processing includes:

calculating the distance between the data objects of the correct text set and the standard error text set by an Euclidean formula, and presetting n class clusters according to a clustering algorithm, wherein the class cluster Center of the kth class cluster is a Center_kCalculating the distance from each data of the correct text set and the standard error text set to the center of each class cluster of the n class clusters, and obtaining the characteristics of each data in the center of each class cluster;

In addition, to achieve the above object, the present invention also provides a computer readable storage medium, which stores a text intelligent editing program, wherein the text intelligent editing program can be executed by one or more processors to implement the steps of the text intelligent editing method as described above.

According to the intelligent text editing method, the intelligent text editing device and the computer readable storage medium, when a user edits a text containing errors, the received correct text set, the received error text set and the established label set are combined to train a pre-established intelligent text editing model to obtain a trained intelligent text editing model, and the text containing errors is edited and input into the trained intelligent text editing model, so that an accurate editing result can be presented to the user.

Drawings

Fig. 1 is a schematic flowchart of an intelligent text editing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of an intelligent text editing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a text intelligent editing program in the text intelligent editing apparatus according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an intelligent text editing method. Fig. 1 is a schematic flow chart of a text intelligent editing method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the intelligent text editing method includes:

s1, receiving a correct text set and an error text set, preprocessing the error text set to obtain a standard error text set, and establishing a corresponding label set for the correct text set and the standard error text set.

In a preferred embodiment of the present invention, the correct text set and the incorrect text set include the same text data, but the incorrect text set has incorrect data such as an incorrect word or a grammatical ambiguity, and the correct text set does not have any incorrect data such as an incorrect word or a grammatical ambiguity.

Further, in a preferred embodiment of the present invention, the preprocessing operation includes: performing word segmentation processing on the error text set to obtain word segmentation results, performing punctuation correction on the error text set according to punctuation correction rules to obtain and label the error punctuation set in the text; and checking the word continuation relation near the target word string by establishing an N-gram model by utilizing the word binary continuation relation to obtain and label the error word string set of the error text set. The specific implementation steps of the pretreatment are as follows:

a. and performing word segmentation processing on the error text set to obtain a word segmentation result.

In the preferred embodiment of the invention, the word segmentation processing is carried out on the error text set through the Markov model to obtain a word segmentation result.

The Markov model is a statistical model and is widely applied to the application fields of natural language processing and the like such as speech recognition, automatic part-of-speech tagging, phonetic-character conversion, probabilistic grammar and the like. In the preferred embodiment of the present invention, the sentence in the erroneous text set is preset as S, the sentence S is segmented by using a full segmentation method to obtain all possible chinese word segmentation modes, the probability of each word segmentation mode is calculated according to markov, and the word segmentation result in the word segmentation mode with the highest probability is selected as the final text word segmentation result.

The Markov property means that the probability of the n word appearing in the text is only related to the appearance of the n-1 word before the n word, but not all the words after the n word, so that the n word is in a word sequence { W }₁，W₂...W_mIn the sentence S formed by the words W, the nth word W is in the condition that the first n-1 words appear_iThe probability of occurrence is:

P(W_i|W₁，...W_i-1)＝P(W_i|W_i-n+1，...W_i-1)

therefore, the probability of the sentence S being arranged in the word order is:

P(S)＝P(W₁W₂...W_m)＝P(W₁)P(W₂|W₁)...P(W_m|W_m-i+1，...W_m-1)

wherein the conditional probability P (W)_m|W_m-i+1，...W_m-1) Represents: in the character string W_m-i+1，...W_m-1In the case of occurrence of W_mThe probability of occurrence is determined by using a binary language model based on the training of a large-scale corpus, and thus, the probability model of the sentence S is:

the invention selects the word segmentation result corresponding to the maximum value of P (S) from all the calculated P (S) as the word segmentation result of the scheme:

b. and performing punctuation correction on the error text set according to punctuation correction rules to obtain and label the error punctuation set in the error text set.

In the preferred embodiment of the invention, punctuation marks are used as driving, specific error types are adopted, based on a preset rule, multi-pass scanning is adopted, and a method of combining context is adopted to correct punctuation marks in the error text set.

In detail, the invention corrects the error text set sentence by sentence, section by section and full text by constructing a local analyzer. Preferably, the principle of the local analyzer is as follows: and dividing the error text set into single sentences according to punctuation marks, and inputting the error text set into a local analyzer according to the text sequence by taking the single sentences as units. In the local analyzer, if the text conforms to the language rule in the local range, the text normally passes through, if the local area is abnormal, the analyzer refuses to accept, and the text is judged to be wrong until the input of the text set with the mistake to the local analyzer is finished. The local analyzer judges which type the punctuation belongs to for each punctuation appearing in the text, judges whether the punctuation symbol has an error by using a corresponding collation rule, and stores an error correction suggestion into an error correction suggestion buffer. The proof reading rule is as follows:

when the corrected punctuation mark is a comma, the punctuation mark except the mark 'is arranged at the position immediately before the comma, or the punctuation mark except the mark' is arranged at the position immediately after the comma, the punctuation mark is indicated by italics in the text to indicate errors, and an error correction suggestion is made to correct more punctuation marks and delete the punctuation mark and is stored in a buffer area. And continuously and sequentially judging punctuation marks from bottom to top.

And when the corrected punctuation mark is a pause sign, judging by utilizing automatic word segmentation and part-of-speech tagging and combining context information. If the two words before and after the pause number are both digital words, the punctuation is used in the text and is represented by a italic body to represent an error, and an error correction suggestion that the punctuation error is added for correction and the pause number is deleted is stored in a buffer area. And continuously and sequentially judging punctuation marks downwards.

When the corrected punctuation mark is an ellipsis, the following three cases are considered:

(r) if the ellipses are preceded by a divide. ","! ","? The punctuation mark outside the ' is used in the text and is represented by italics to represent errors, and the error correction suggestion ' corrects more punctuation errors and suggests to delete the punctuation mark in the front ' to be stored in a buffer area;

if punctuation marks are arranged behind the ellipses, the punctuation marks are used in the text and are represented by italics to represent errors, and an error correction suggestion that the punctuation marks are more corrected and the punctuation marks behind are deleted is suggested to be stored in a buffer zone;

and thirdly, if the ellipses are followed by one of the three words of 'and the like', the punctuation is italicized and used in the text to represent errors, and the error correction suggestion 'corrects the error of the punctuation more and suggests to delete the ellipses' and store the ellipses in a buffer area. And continuously and sequentially judging punctuation marks downwards.

In the punctuation mark proofreading rule, for the error correction suggestion stored in the error correction suggestion buffer, in the preferred embodiment of the present invention, after the punctuation mark proofreading is completed, the corresponding error punctuation marks are sequentially displayed in the interface according to the sequence of errors in sentences, so as to obtain an error punctuation mark set in the error text set.

c. And utilizing the binary word continuing relation to check the word continuing relation near the target word string of the error text set by establishing an N-gram model to obtain and label the error word string set of the error text set.

The continuation relation refers to the adjacent relation between words. The binary continuous relation refers to the investigation of the character string z₁z₂z₃......z_i-1z_i......z_nMiddle z_iWhen adjacent relation with adjacent words is formed, according to an N-Gram model in corpus linguistics, wherein the binary model theory obtained when N is 2 only needs to be considered_i-1And z_iAnd z_iAnd z_i+1The relationship therebetween is sufficient. The invention analyzes and processes the large-scale corpus, and when z is_i-1And z_iIs (z) is determined_i/z_i-1) When a certain threshold value is satisfied, z is judged_i-1And z_iContinuing and identifying the character string z according to the result of the continuation judgment_iWhether an error has occurred. The preferred embodiment of the present invention first checks z_i-1And z_iIf it is not, then check z_iAnd z_i+1If the connection relation is not continuous, then determine the character string z_iAnd (6) error occurrence.

In detail, the preferred embodiment of the present invention presets the sentences in the error text set as S ═ z₁z₂z₃......z_i-1z_i......z_nWherein z is_iAnd z_i+1The capacity of the character strings in the Chinese language database is N, z for two adjacent character strings_iAnd z_i+1The number of adjacent times is r (z)_i，z_i+1)，z_iAnd z_i+1The independent occurrence times are respectively r (z)_i)、r(z_i+1) Then z is_iAnd z_i+1The probability of independent occurrence is:

p(z_i)＝r(z_i)/N，

p(z_i+1)＝r(z_i+1)/N；

z_iand z_i+1The co-occurrence probability of neighbors is:

p(z_i，z_i+1)＝r(z_i，z_i+1)/N。

when r (z)_i，z_i+1)＝N*p(z_i，z_i+1) When is equal to or more than tau, represents z_iAnd z_i+1Has a higher co-occurrence frequency of (c), and z is judged_iAnd z_i+1Continuously, illustrate said word string z_iCorrect; on the contrary, when r (z)_i，z_i+1)＝N*p(z_i，z_i+1) If τ, the word string z is described_iAnd (4) an error. Wherein tau is a threshold value, and tau is preset to be 0.8. Preferably, the present invention obtains the error string set of the error text set by performing traversal check on the error text set.

Further, in the preferred embodiment of the present invention, a standard error text set is obtained according to the error punctuation mark set and the error string set obtained by the preprocessing.

And S2, converting the correct text set and the standard error text set into word vectors through a bag-of-words model, and storing the word vectors as a training set into a corpus.

The bag-of-words model is used to represent text as feature vectors, and the basic idea is to assume that for a text, its word order and syntax, syntax are ignored, and only it is considered as a collection of words.

In detail, the converting the correct text set and the standard error text set into word vectors by the bag-of-words model according to the preferred embodiment of the present invention includes:

A. and calculating the distance between the data objects of the correct text set and the standard error text set by using a Euclidean formula.

Preset x_i、x_jData of the correct text set and the standard error text set respectively, D represents the attribute number of the data object of the correct text set and the standard error text set, and the European expressionThe formula is as follows:

B. presetting n clusters according to a clustering algorithm, wherein the cluster Center of the kth cluster is a Center_kThe Center_kRepresenting a vector containing attributes of data objects, said Center_kThe formula of (1) is:

wherein, C_kIndicating the number of data objects in the kth class cluster.

Further, the invention passes the European formula and the Center_kAnd calculating the distance from each data of the correct text set and the standard error text set to the center of each class cluster in the n class clusters by using an updating formula, and obtaining the characteristics of each data in the center of each class cluster.

C. And training the features by using a classifier, and calculating the probability of each data of the correct text set and the standard error text set in the cluster center, so as to convert the correct text set and the standard error text set into word vectors.

The classifier is a naive bayes classifier, which is a series of simple probabilistic classifiers based on the strong (naive) independence between assumed features using bayes theorem.

In a preferred embodiment of the present invention, the method for calculating the probability of each data in the correct text set and the standard error text set at the center of the cluster is as follows:

presetting independence hypothesis among the characteristics, wherein preset data samples are as follows: x ═ x₁，x₂，...，x_d)^T(ii) a Calculating the center w of each data in the cluster_iProbability of being：

Wherein d is a characteristic dimension of data in the predetermined data sample, x_kIs the value of the sample on the kth feature;

smoothing the data in the preset data sample by using the following formula to achieve the purpose of avoiding data sparseness:

wherein, c_kRepresenting the number of possible values of the k-dimension characteristic, wherein alpha is a coefficient;

the maximum likelihood estimation method is used for obtaining:

wherein the molecule represents the cluster-like center w_iSet D of_iThe value of the kth characteristic is x_kThe number of samples of (c).

S3, inputting the training set and the label set into a pre-constructed intelligent text editing model, training the intelligent text editing model by using the training set to obtain a training value, inputting the training value and the label set into a loss function of the intelligent text editing model to obtain a loss function value, and quitting training of the intelligent text editing model when the loss function value is smaller than a preset threshold value.

In a preferred embodiment of the present invention, the intelligent text editing model includes a convolutional neural network. The convolutional neural network is a feedforward neural network, the artificial neurons of the convolutional neural network can respond to surrounding units within a part of coverage range, the basic structure of the convolutional neural network comprises two layers, one layer is a characteristic extraction layer, the input of each neuron is connected with a local receiving domain of the previous layer, and the local characteristics are extracted. Once the local feature is extracted, the position relation between the local feature and other features is determined; the other is a feature mapping layer, each calculation layer of the network is composed of a plurality of feature mappings, each feature mapping is a plane, and the weights of all neurons on the plane are equal.

In a preferred embodiment of the present invention, the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer, and an output layer. In a preferred embodiment of the present invention, the input layer of the convolutional neural network model receives the training set and the label set, and performs convolution operation on the training set by presetting a set of filters in the convolutional layer to extract feature vectors, where the filters may be { filter₀，filter₁-generating a set of features on similar channels and dissimilar channels, respectively; and performing pooling operation on the feature vectors by using the pooling layer, inputting the pooled feature vectors into a full-connection layer, performing normalization processing and calculation on the pooled feature vectors through an activation function to obtain a training value, inputting a calculation result into an output layer, and outputting correct text data by the output layer. The normalization process is to "compress" a K-dimensional vector containing arbitrary real numbers to another K-dimensional real vector such that each element ranges between (0, 1) and the sum of all elements is 1.

In the embodiment of the present invention, the activation function is a softmax function, and a calculation formula is as follows:

wherein, O_jRepresenting the correct text data output value, I, of the jth neuron of the convolutional neural network output layer_jAnd the input value of the jth neuron of the convolutional neural network output layer is represented, t represents the total quantity of the neurons of the output layer, and e is an infinite acyclic decimal.

In a preferred embodiment of the present invention, the threshold of the predetermined loss function value is 0.01, and the loss function is a minimum two-fold multiplication:

wherein s is an error value between the output correct text data and the output error text data, k is the number of the text sets, y_iIs the erroneous text data, y'_iIs the correct text data.

And S4, receiving text data input by a user, intelligently editing the text data input by the user by using the intelligent text editing model, and outputting corresponding correct text data.

The preferred embodiment of the invention utilizes the intelligent text editing model to automatically correct and edit the text data input by the user to obtain the corrected text data, and can realize the output of the text data with the correction mark and the correct text data.

The invention also provides an intelligent text editing device. Fig. 2 is a schematic diagram of an internal structure of an intelligent text editing apparatus according to an embodiment of the present invention.

In the present embodiment, the intelligent text editing apparatus 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server. The intelligent text editing device 1 at least comprises a memory 11, a processor 12, a communication bus 13 and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be an internal storage unit of the intelligent text editing apparatus 1 in some embodiments, for example, a hard disk of the intelligent text editing apparatus 1. The memory 11 may also be an external storage device of the intelligent text editing apparatus 1 in other embodiments, such as a plug-in hard disk provided on the intelligent text editing apparatus 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the intelligent text editing apparatus 1. The memory 11 can be used for storing not only the application software installed in the intelligent text editing apparatus 1 and various data, such as the code of the intelligent text editing program 01, but also temporarily storing data that has been output or is to be output.

Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, is configured to execute program codes or process data stored in memory 11, such as executing text intelligent editor 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., a WI-FI interface), and is typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display may also be referred to as a display screen or a display unit, where appropriate, for displaying information processed in the intelligent text editing apparatus 1 and for displaying a visual user interface.

Fig. 2 shows only the intelligent editing apparatus 1 having the components 11 to 14 and the intelligent editing program 01, and those skilled in the art will understand that the structure shown in fig. 1 does not constitute a limitation of the intelligent editing apparatus 1, and may include fewer or more components than those shown, or combine some components, or arrange different components.

In the embodiment of the apparatus 1 shown in fig. 2, a text intelligent editing program 01 is stored in the memory 11; the following steps are implemented when the processor 12 executes the text intelligent editing program 01 stored in the memory 11:

step one, receiving a correct text set and an error text set, carrying out preprocessing operation on the error text set to obtain a standard error text set, and establishing a corresponding label set for the correct text set and the standard error text set.

Further, in a preferred embodiment of the present invention, the preprocessing operation includes: performing word segmentation processing on the error text set to obtain word segmentation results, performing punctuation correction on the error text set according to punctuation correction rules to obtain and label the error punctuation set in the text; and checking the word continuation relation near the target word string by establishing an N-gram model by utilizing the word binary continuation relation to obtain and label the error word string set of the error text set. The specific implementation steps are as follows:

The Markov property means that the probability of the n word appearing in the text is only related to the appearance of the n-1 word before the n word, but not all the words after the n word, so that the n word is in a word sequence { W }₁，W₂...W_mIn the sentence S formed by the words, under the condition that the first n-1 words appear, the probability of the nth word appearing is as follows:

P(W_n|W₁，...W_n-1)＝P(W_n|W_n-i+1，...W_n-1)

P(S)＝P(W₁W₂...W_m)＝P(W₁)P(W₂|W₁)...P(W_m|W_m-i+1，...W_m-1)

wherein the conditional probability P (W)_m|W_m-i+1，...W_m-1) Represents: in the character string W_m-i+1，...W_m-1In the case of occurrence of W_mThe probability of occurrence is determined by using a binary grammar model based on the training of a large-scale corpus, and thus the probability model of the sentence S is:

In detail, the invention corrects the error text set sentence by sentence, section by section and full text by constructing a local analyzer. Preferably, the principle of the local analyzer is as follows: and dividing the error text set into single sentences according to punctuation marks, and inputting the error text set into a local analyzer according to the text sequence by taking the single sentences as units. In the local analyzer, if the text conforms to the language rule in the local range, the text normally passes through, if the local area is abnormal, the analyzer refuses to accept, and the text is judged to be wrong until the input of the text set with the mistake to the local analyzer is finished. The local analyzer judges which type the punctuation belongs to for each punctuation appearing in the text, judges whether the punctuation symbol has an error by using a corresponding collation rule, and stores an error correction suggestion into an error correction suggestion buffer. The calibration rule is as follows:

when the corrected punctuation mark is a comma, the punctuation mark except the mark 'is arranged at the position immediately before the comma, or the punctuation mark except the mark' is arranged at the position immediately after the comma, the punctuation mark is indicated by italics in the text to indicate errors, and an error correction suggestion is made to correct more punctuation marks and delete the punctuation mark and is stored in a buffer area. And continuously and sequentially judging punctuation marks downwards.

And when the corrected punctuation mark is a pause sign, judging by utilizing automatic word segmentation and part-of-speech tagging and combining context information. If the two words before and after the pause number are both digital words, the punctuation is used in the text and is represented by an italic, so that errors are represented, and an error correction suggestion that the punctuation errors are more corrected and the pause number is deleted is suggested to be stored in a buffer area. And continuously and sequentially judging punctuation marks downwards.

and thirdly, if one of three words of 'and the like', equal and 'like' is followed by the ellipses, the punctuation is used in the text and is indicated by italics, and the error correction suggestion 'corrects the error with more punctuation and then deletes the ellipses' is suggested to be stored in a buffer area. And continuously and sequentially judging punctuation marks downwards.

c. And utilizing the word binary continuing relation to check the word continuing relation near the target word string of the error text set by establishing an N-gram model, so as to obtain and label the error word string set of the error text set.

The continuation relation refers to the adjacent relation between words. The binary continuous relation refers to the investigation of the character string z₁z₂z₃......z_i-1z_i......z_nMiddle z_iWhen adjacent relation with adjacent words is formed, according to an N-Gram model in corpus linguistics, wherein the binary model theory obtained when N is 2 only needs to be considered_i-1And z_iAnd z_iAnd z_i+1The relationship therebetween is sufficient. The invention analyzes and processes the large-scale corpus, and when z is_i-1And z_iIs (z) is determined_i/z_i-1) When a certain threshold value is satisfied, z is judged_i-1And z_iContinuing and identifying the character string z according to the result of the continuation judgment_iWhether an error has occurred. The preferred embodiment of the present invention first checks z_i-1And z_iIf it is not, then check z_iAnd z_i+1If the connection relation is not continuous, then determine the character string z_iAnd (6) making a mistake.

p(z_i)＝r(z_i)/N，

p(z_i+1)＝r(z_i+1)/N；

z_iand z_i+1The co-occurrence probability of neighbors is:

p(z_i，z_i+1)＝r(z_i，z_i+1)/N。

And step two, converting the correct text set and the standard error text set into word vectors through a word bag model, and storing the word vectors as a training set into a corpus.

Presetting x_i、x_jD represents the attribute number of the data objects of the correct text set and the standard error text set, and the European formula is as follows:

wherein, C_kIndicating the number of data objects in the kth class cluster. Further, the invention passes the European formula and the Center_kAnd calculating the distance from each data of the correct text set and the standard error text set to the center of each class cluster in the n class clusters by using an updating formula, and obtaining the characteristics of each data in the center of each class cluster.

C. And training the features by using a classifier, and calculating the probability of each data of the correct text set and the standard error text set in the cluster-like center, thereby converting the correct text set and the standard error text set into word vectors.

presetting independence hypothesis among the characteristics, wherein preset data samples are as follows: x ═ x₁，x₂，...，x_d)^T(ii) a Calculating the center w of each data in the cluster_iThe probability is:

the maximum likelihood estimation method obtains:

Inputting the training set and the label set into a pre-constructed intelligent text editing model, training the intelligent text editing model by using the training set to obtain a training value, inputting the training value and the label set into a loss function of the intelligent text editing model to obtain a loss function value, and exiting the training of the intelligent text editing model when the loss function value is smaller than a preset threshold value.

And step four, receiving text data input by a user, intelligently editing the text data input by the user by using the text intelligent editing model, and outputting corresponding correct text data.

Alternatively, in other embodiments, the text intelligent editing program may be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 3, a schematic diagram of program modules of a text intelligent editing program in an embodiment of the text intelligent editing apparatus of the present invention is shown, in this embodiment, the text intelligent editing program may be divided into a text preprocessing module 10, a model training module 20, and a text intelligent editing module 30, exemplarily:

the keyword received text preprocessing module 10 is configured to: receiving a correct text set and an error text set, preprocessing the error text set to obtain a standard error text set, establishing a corresponding label set for the correct text set and the standard error text set, converting the correct text set and the standard error text set into word vectors through a word bag model, and storing the word vectors as a training set in a corpus.

The model training module 20 is configured to: inputting the training set and the label set into a pre-constructed intelligent text editing model, training the intelligent text editing model by using the training set to obtain a training value, inputting the training value and the label set into a loss function of the intelligent text editing model to obtain a loss function value, and quitting training of the intelligent text editing model when the loss function value is smaller than a preset threshold value.

The intelligent text editing module 30 is configured to: and receiving text data input by a user, intelligently editing the text data input by the user by using the text intelligent editing model, and outputting corresponding correct text data.

The functions or operation steps of the program modules such as the text preprocessing module 10, the model training module 20, and the text intelligent editing module 30 when executed are substantially the same as those of the above embodiments, and are not repeated herein.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a text intelligent editing program is stored on the computer-readable storage medium, where the text intelligent editing program is executable by one or more processors to implement the following operations:

receiving text data input by a user, intelligently editing the text data input by the user by using the text intelligent editing model, and outputting corresponding correct text data

The embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the text intelligent editing apparatus and method, and will not be described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements, groups, etc., without limitation to any particular feature or element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. A text intelligent editing method is characterized by comprising the following steps:

receiving a correct text set and an error text set, performing preprocessing operation on the error text set to obtain a standard error text set, and establishing a corresponding label set for the correct text set and the standard error text set, wherein the preprocessing operation comprises the following steps: performing word segmentation processing on the error text set to obtain a word segmentation result, performing punctuation correction on the error text set by using the word segmentation result and according to punctuation correction rules to obtain an error punctuation set in the error text set, and performing word continuation relation check on the error text set by establishing an N-gram model and using a word binary continuation relation to obtain an error string set of the error text set;

training the features by using a classifier, calculating the probability of each data in the cluster center, converting the correct text set and the standard error text set into word vectors, and storing the word vectors as a training set into a corpus;

2. The intelligent text editing method according to claim 1, wherein the word segmentation process comprises:

3. The intelligent text editing method of claim 1, wherein the training the intelligent text editing model with the training set to obtain a training value comprises:

4. An intelligent text editing apparatus, comprising a memory and a processor, wherein the memory stores a intelligent text editing program operable on the processor, and the intelligent text editing program when executed by the processor implements the following steps:

receiving a correct text set and an error text set, performing preprocessing operation on the error text set to obtain a standard error text set, and establishing a corresponding label set for the correct text set and the standard error text set, wherein the preprocessing operation comprises the following steps: performing word segmentation processing on the error text set to obtain a word segmentation result, performing punctuation correction on the error text set by using the word segmentation result and according to punctuation correction rules to obtain an error punctuation set in the error text set, and performing word continuation relation check on the error text set by establishing an N-gram model and by using a word binary continuation relation to obtain an error string set of the error text set, wherein words near a target string are subjected to word continuation relation check;

calculating the distance between the data objects of the correct text set and the standard error text set by an Euclidean formula, and presetting n class clusters according to a clustering algorithm, wherein the class cluster Center of the kth class cluster is the Center_kCalculating the distance from each data of the correct text set and the standard error text set to the center of each class cluster of the n class clusters, and obtaining the characteristics of each data in the center of each class cluster;

5. The intelligent editing device for text according to claim 4, wherein the word segmentation process comprises:

6. A computer-readable storage medium having stored thereon a text intelligent editing program executable by one or more processors to perform the steps of the text intelligent editing method of any one of claims 1 to 3.