CN115688685A

CN115688685A - Text processing method and device, electronic equipment and storage medium

Info

Publication number: CN115688685A
Application number: CN202110865873.9A
Authority: CN
Inventors: 姜博然
Original assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Current assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2023-02-03

Abstract

The disclosure provides a text processing method and device, electronic equipment and a storage medium, and belongs to the technical field of computers. In the embodiment of the disclosure, an input text to be processed can be acquired, a reference text which is matched with the text to be processed and meets the preset requirement is selected from a preset text library, a target reference sentence which is similar to a sentence to be modified in the text to be processed is determined in the reference text, and finally, the sentence to be modified is converted according to the target reference sentence according to a target sentence conversion model, so that a target recommended sentence corresponding to the sentence to be modified is obtained. Therefore, the sentence to be modified is converted according to the target reference sentence through the sentence conversion model, and the target recommended sentence with more accurate words and more conforming to the habit of the examiner in the expression mode can be obtained, so that the text expression content and the expression mode can be adjusted, a high-quality written text can be obtained without manually modifying the text by a user, and the text processing efficiency is improved.

Description

Text processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a text processing method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of computer technologies such as machine learning in recent years, people often use machine learning models to solve various problems. Among them, as the requirement of text Processing is more and more increased, the development of Natural Language Processing (NLP) technology is also faster and faster. For example, in work or learning, users are often required to write text with high quality, i.e., text with higher levels of words and expressions, and especially for people who write languages other than native languages, the writing is difficult and takes a lot of time.

In the related art, the natural language processing method mainly focuses on the field of syntax error correction. Therefore, a text processing method capable of adjusting the content and the manner of the text expression is urgently needed.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a text processing method, apparatus, electronic device, and storage medium.

According to a first aspect of the present disclosure, there is provided a text processing method, the method including:

acquiring an input text to be processed;

selecting a reference text which is matched with the text to be processed and meets preset requirements from a preset text library;

in the reference text, determining a target reference sentence similar to the sentence to be modified in the text to be processed;

and converting the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified. Optionally, a plurality of sample texts are stored in the preset text library;

selecting a reference text which is matched with the text to be processed and meets a preset requirement from a preset text library, wherein the method comprises the following steps:

determining a target field to which the text to be processed belongs, and determining a value label of each sample text;

classifying the value labels, and taking the sample text of which the processing result meets the preset requirement as a first type text;

screening sample texts belonging to the target field to serve as second type texts;

and taking sample texts which belong to the first type text and the second type text at the same time as the reference text.

Optionally, the determining a target domain to which the text to be processed belongs includes:

acquiring keywords in the text to be processed;

and taking the field matched with the keyword as a target field to which the text to be processed belongs.

Optionally, the method further includes:

splitting the sample text to obtain text fragments corresponding to different content attributes;

and respectively storing the text segments corresponding to the content attributes in the preset text library according to the content attributes.

Optionally, the determining, in the reference text, a target reference sentence similar to the sentence to be modified in the text to be processed includes:

determining a sentence to be modified in the text to be processed;

and screening the reference text according to the sentence to be modified and a preset text screening algorithm to determine a target reference sentence similar to the sentence to be modified.

Optionally, the screening the reference text according to the sentence to be modified and a preset text screening algorithm to determine a target reference sentence similar to the sentence to be modified includes:

screening sentences contained in the reference text by using a first screening algorithm, and determining a first reference sentence similar to the sentence to be modified;

and screening the first reference sentence by using a second screening algorithm, and taking the first reference sentence with the similarity meeting a preset threshold value as the target reference sentence.

Optionally, the method further includes:

obtaining a plurality of sample sentences;

translating the sample sentence according to a preset translation method to obtain a sample translation sentence;

taking the sample sentence and the sample translation sentence as a training sample pair;

and training an initial sentence conversion model by utilizing the training sample pair to obtain the target sentence conversion model.

Optionally, the method further includes:

setting a position embedding parameter in the initial sentence conversion model to a trainable value so as to carry out sample training on the position embedding parameter.

Optionally, the initial statement conversion model is a transform model.

According to a second aspect of the present disclosure, there is provided a text processing apparatus including:

the first acquisition module is used for acquiring an input text to be processed;

the selection module is used for selecting a reference text which is matched with the text to be processed and meets the preset requirement from a preset text library;

the first determining module is used for determining a target reference sentence similar to a sentence to be modified in the text to be processed in the reference text;

and the conversion module is used for converting the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified.

Optionally, a plurality of sample texts are stored in the preset text library;

the selecting module is further configured to:

Optionally, the selecting module is further configured to:

acquiring keywords in the text to be processed;

Optionally, the apparatus further comprises:

the splitting module is used for splitting the sample text to obtain text fragments corresponding to different content attributes;

and the storage module is used for respectively storing the text segments corresponding to the content attributes into the preset text library according to the content attributes.

Optionally, the first determining module is further configured to:

determining a sentence to be modified in the text to be processed;

Optionally, the first determining module is further configured to:

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a plurality of sample sentences;

the translation module is used for translating the sample sentence according to a preset translation method to obtain a sample translation sentence;

a second determining module, configured to use the sample sentence and the sample translation sentence as a training sample pair;

and the training module is used for training the initial sentence conversion model by utilizing the training sample pair so as to obtain the target sentence conversion model.

Optionally, the apparatus further comprises:

and the setting module is used for setting the position embedding parameter in the initial sentence conversion model as a trainable value so as to carry out sample training on the position embedding parameter.

Optionally, the initial sentence conversion model is a Transformer model.

In accordance with a third aspect of the present disclosure, there is provided an electronic device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the text processing method of any one of the first aspects.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform the text processing method according to any one of the first aspect.

According to a fifth aspect of the present disclosure, a computer program product is provided, comprising readable program instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the steps of the text processing method as in any one of the embodiments described above.

Compared with the related art, the method has the following advantages and positive effects:

the text processing method provided by the embodiment of the disclosure can acquire an input text to be processed, select a reference text which is matched with the text to be processed and meets a preset requirement from a preset text library, determine a target reference sentence which is similar to a sentence to be modified in the text to be processed in the reference text, and finally convert the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified. Therefore, the sentence to be modified is converted according to the target reference sentence through the sentence conversion model, and the target recommended sentence with more accurate words and more conforming to the habit of auditors in the expression mode can be obtained, so that the text expression content and the expression mode can be adjusted, a user can obtain a high-quality written text without manually modifying the text, and the text processing efficiency is improved.

The foregoing description is only an overview of the technical solutions of the present disclosure, and the embodiments of the present disclosure are described below in order that the technical means of the present disclosure may be clearly understood, and the foregoing and other objects, features, and advantages of the present disclosure may be more clearly understood.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the disclosure. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flowchart illustrating steps of a method for processing text according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a text processing method provided by an embodiment of the present disclosure;

fig. 3 is a block diagram of a text processing apparatus provided in an embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating an apparatus for text processing in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating another apparatus for text processing in accordance with an example embodiment.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a flowchart of steps of a text processing method provided in an embodiment of the present disclosure, and as shown in fig. 1, the method may include:

step 101, an input text to be processed is obtained.

In the embodiment of the present disclosure, the text to be processed may be a text whose expression needs to be adjusted, and the input text to be processed is acquired, or the text which is selected by the user and needs to be adjusted may be used as the text to be processed. For example, a button of "upload paper" is displayed on the display interface, and by clicking the button, a text that needs to be adjusted can be selected as an uploaded text, and accordingly, the text can be determined as a text to be processed.

And 102, selecting a reference text which is matched with the text to be processed and meets preset requirements from a preset text library.

In the embodiment of the present disclosure, the text matched with the text to be processed may be selected as a text having a same attribute as that of the text to be processed, for example, the attribute may be a technical field to which the text belongs, may also be a keyword included in the text, may also be a text published newspaper, and the like. The preset requirement may be that the texts are sorted according to a certain dimension, and N top ranked texts are used as the texts meeting the preset requirement, where N is a positive integer, for example, the preset requirement may be that the texts are sorted according to the reference amount of the texts, and then the texts meeting the preset requirement may be the top N texts with the largest reference amount; the preset requirement may also be ordering according to the size of the labels corresponding to the text publication periodicals, and the text meeting the preset requirement may be the first N largest labels corresponding to the publication periodicals, which is not limited in this disclosure.

In the embodiment of the disclosure, the reference text which is matched with the text to be processed and meets the preset requirement is selected, the text which is matched with the text to be processed is selected firstly, then the text which meets the preset requirement is selected from the matched text to be used as the reference text, or the text which meets the preset requirement is selected firstly, then the matched text is selected from the text which meets the preset requirement to be used as the reference text, or the text which is matched with the text to be processed and the text which meets the preset requirement are selected simultaneously, and the text which is matched and meets the preset requirement is used as the reference text.

It should be noted that the preset text library may be a pre-stored text, where the text may be an article published in a specified journal, a document belonging to a certain technical field, a paper written by a certain author, and the like, and the disclosure is not limited thereto. Compared with the text to be processed, the text stored in the preset text library is often the text which is checked by the professional, so that the text in the preset text library has the advantages of more accurate word usage and more conformity with the habit of the checker compared with the text to be processed.

And 103, determining a target reference sentence similar to the sentence to be modified in the text to be processed in the reference text.

In the embodiment of the disclosure, the similarity between each sentence in the reference text and the sentence to be modified may be calculated respectively, and the sentence with the similarity larger than the preset threshold may be used as the target reference sentence with which the sentence to be modified is similar. The similarity may be determined according to the similarity of the expression content, or may be determined according to the similarity of the expression manner, which is not limited in this disclosure.

And 104, converting the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified.

In the embodiment of the present disclosure, the target sentence conversion model may be obtained by performing iterative training on the initial sentence conversion model according to a training sample pair of the sample sentence and the sample translation sentence. Through continuous iterative training, the sentence conversion model can learn the capability of converting sentences according to a target expression mode on the premise of unchanged expression content. Therefore, in the embodiment of the disclosure, the sentence to be modified can be converted according to the target reference sentence by using the sentence conversion model obtained by training, so as to obtain the target recommended sentence corresponding to the sentence to be modified.

To sum up, the text processing method provided by the embodiment of the present disclosure may obtain an input text to be processed, select a reference text that matches the text to be processed and meets a preset requirement from a preset text library, determine a target reference sentence similar to a sentence to be modified in the text to be processed in the reference text, and finally convert the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified. Therefore, the sentence to be modified is converted according to the target reference sentence through the sentence conversion model, and the target recommended sentence with more accurate words and more conforming to the habit of the examiner in the expression mode can be obtained, so that the text expression content and the expression mode can be adjusted, a high-quality written text can be obtained without manually modifying the text by a user, and the text processing efficiency is improved.

Optionally, in the embodiment of the present disclosure, a plurality of sample texts are stored in the preset text library, where the sample texts may be articles published in a specified journal, for example, all articles published in an academic journal, or articles published in a foreign journal, and the operation of selecting, in the preset text library, a reference text that matches the to-be-processed text and meets a preset requirement may specifically include:

step 1021, determining a target field to which the text to be processed belongs, and determining a value label of each sample text.

In the embodiment of the present disclosure, determining the target field to which the text to be processed belongs may be determining the target field to which the text to be processed belongs according to the content of the text to be processed, or may be determining that the field to which the text to be processed belongs is marked in the text to be processed, and taking the field as the target field, or obtaining a specified field input by a user, and taking the specified field as the field to which the text to be processed belongs.

In the embodiment of the disclosure, the value label of the sample text may be determined according to a journal grade published by the sample text, specifically, the journal sent by the sample text is determined first, for example, the sample is published on a journal X, and then the value label of the sample text is determined according to an evaluation parameter corresponding to the journal, for example, if the evaluation parameter of the journal X is 7, the value label of the sample text is 7.

And 1022, classifying the value label, and taking the sample text of which the processing result meets the preset requirement as a first type text.

In the embodiment of the present disclosure, the value tag may be classified by using a preset classification algorithm, and the preset classification algorithm may be a recommendation algorithm (KGCN model) based on a knowledge Graph in a Graph Convolutional neural Network (GCN). Specifically, articles in multiple academic journals can be randomly selected, and a corresponding knowledge graph is constructed for each article, that is, a knowledge graph obtained by decomposing the articles according to each content attribute is obtained, then, value tags are respectively set according to evaluation parameters corresponding to each partition in the academic journals, for example, the value tag corresponding to the first area of the academic journals is 10, the value tag corresponding to the second area of the academic journals is 7, the value tag corresponding to the third area of the academic journals is 4, and the value tag corresponding to the fourth area of the academic journals is 2, the articles on the academic journals and the value tag corresponding to each article are input into a graph volume KGCN model for training, the KGCN model is a model that a part having a large numerical influence on a final value tag is extracted by spreading the articles on the knowledge graph, wherein the principle can be that a certain range of adjacent nodes is continuously aggregated and the aggregated vector is used to replace the vector representation of the current node, and the aggregation formula can be as follows:

wherein, agg _sum May be for calculating the final price of each text segment pairThe aggregation formula of the numerical influence of the value labels, sigma, can be a non-linear function, u can represent the embedding vector representation of the domain to which the article belongs, v can represent the training set article embedding vector representation,

can be expressed as embedded vector representation, S, of the neighborhood of an article to be trained _(v) The number of the adjacent nodes of the article to be trained can be controlled, for example, S can be set _(v) Defined as a hyperparameter K, e.g., K =3,w may be the weight of the fully connected layer and b may be the bias.

Further, the final inner product between the embedded expression vector u of the article domain and the embedded vector expression v of the article can be calculated to obtain a probability f (u, v), and the probability calculation formula can be:

y _uv ＝f(u,v)

by training this probability and the loss of value labels, the probability f (u, v) can be maximized. In order to improve the accuracy of text processing, the loss function in the KGCN model may be changed to a variance in the embodiment of the present disclosure.

In the embodiment of the present disclosure, the preset classification algorithm may also be a random forest model in the conventional machine learning. Specifically, articles in a plurality of academic journals can be selected, a corresponding knowledge graph is constructed for each article, a value label of each article is determined, characteristic information can be selected according to attributes of the articles, the characteristic information can be any one or more of the number of articles to be quoted, influence factors of the journals, keywords of the articles and the like, each article, the corresponding value label and the corresponding characteristic information are input into a random forest model, and finally a result obtained through fitting is a value arrangement sequence of the article.

In the embodiment of the disclosure, the sample text whose processing result meets the preset requirement is used as the first type text, which may be a value arrangement sequence corresponding to a plurality of sample texts obtained after classifying the value tags, and the first ten sample texts with the highest value are used as the first music line text according to the arrangement sequence.

And 1023, screening the sample texts belonging to the target field to serve as second type texts.

In the embodiment of the present disclosure, the field to which each sample text belongs may be determined first, and then the sample text in which the field belongs is the target field may be used as the second type text. The determining of the field to which each sample text belongs may be extracting a keyword from the sample text, determining a corresponding field according to the keyword, and using the field as the field to which the sample text belongs.

Step 1024, using sample texts belonging to the first type of text and the second type of text at the same time as the reference text.

In the embodiment of the present disclosure, a first type text may be determined, and then a second type text is determined from the first type text, so that a sample text that belongs to both the first type text and the second type text may be obtained, and the sample text is used as a reference text, or a second type text may be determined, and then the first type text is determined from the second type text, so that a sample text that belongs to both the first type text and the second type text may be obtained, and the sample text is used as a reference text, which is not limited in the present disclosure.

In the embodiment of the disclosure, the target field to which the text to be processed belongs is determined, the value labels of the sample texts are determined, the value labels are classified, the sample text with the processing result meeting the preset requirement is used as the first type text, the sample text belonging to the target field is screened and used as the second type text, and finally, the sample text belonging to both the first type text and the second type text is used as the reference text, so that the sample text belonging to the same field as the text to be processed and having a higher text value can be screened out, and the accuracy of model training can be improved by using the sample text as the training sample.

Optionally, the operation of determining the target field to which the text to be processed belongs in the embodiment of the present disclosure may specifically include:

and (1) acquiring keywords in the text to be processed.

In the embodiment of the disclosure, when the text to be processed is recorded with the keyword of the text, the keyword can be directly used as the keyword in the text to be processed; when the text to be processed is not recorded with the keywords, the text to be processed may be recognized by using a preset Recognition algorithm, and the recognized keywords are used as the keywords of the text to be processed, where the preset Recognition algorithm may be Named Entity Recognition technology (NER), and the text to be processed is recognized by using basic tools such as information extraction, syntactic analysis, machine translation, and the like, and the keywords are extracted.

And (2) taking the field matched with the keyword as a target field to which the text to be processed belongs.

In the embodiment of the disclosure, different fields can be matched with different keywords in advance, the field matched with the keyword is determined according to the keyword in the text to be processed, and the field is used as the target field of the text to be processed. It should be noted that, when there are many keywords obtained in the text to be processed, for example, there are 11 keywords, the first three keywords may be selected to determine the matching field. The method is characterized in that the sequence of the keywords is usually ordered according to the degree of importance of the keywords in the text, the first three keywords are selected to determine the matched field, the accuracy of determining the field to which the text to be processed belongs can be improved to a certain degree, and the matched field can be selected to simultaneously contain the three keywords.

Optionally, in an implementation manner, the following steps may also be performed in the embodiment of the present disclosure:

and S11, splitting the sample text to obtain text segments corresponding to different content attributes.

In the embodiment of the present disclosure, the content attribute may be related information such as an author, a title, an abstract, a keyword, a field, a publication location, and a publication journal of the sample text. The sample text is split to obtain text segments corresponding to different content attributes, or the sample text is decomposed according to each content attribute to obtain a text segment corresponding to each content attribute.

And a substep S12 of respectively storing the text segments corresponding to the content attributes in the preset text library according to the content attributes.

In the embodiment of the disclosure, after a plurality of sample texts stored in a preset text library are decomposed, classification is performed according to each content attribute, so that text segments belonging to the same content attribute in each sample text can be obtained, and the text segments are respectively stored in the preset text library according to the content attribute.

Optionally, in the embodiment of the present disclosure, the operation of determining, in the reference text, a target reference sentence similar to the sentence to be modified in the text to be processed may specifically include:

and step 1031, determining the sentence to be modified in the text to be processed.

In the embodiment of the present disclosure, the sentence to be modified may be a sentence whose expression mode needs to be adjusted or whose expression content needs to be replaced. The sentence to be modified in the text to be processed is determined, a sentence arbitrarily intercepted from the text to be processed according to the sentence number can be used as the sentence to be modified, or a sentence designated by the user in the text to be processed can be used as the sentence to be modified.

And 1032, screening the reference text according to the sentence to be modified and a preset text screening algorithm to determine a target reference sentence similar to the sentence to be modified.

In the embodiment of the present disclosure, the preset text screening algorithm may be a text screening algorithm in a Natural Language Processing (NLP) algorithm, for example, the text screening algorithm may be a TextRank algorithm or a BERT algorithm, which is not limited in the present disclosure. For example, the similarity between each sentence in the reference text and the sentence to be modified may be calculated by using a BERT algorithm, and the sentence with the highest similarity may be used as the target reference sentence similar to the sentence to be modified.

Optionally, in the embodiment of the present disclosure, the operation of screening the reference text according to the sentence to be modified and a preset text screening algorithm to determine a target reference sentence similar to the sentence to be modified specifically includes:

and (3) screening sentences contained in the reference text by using a first screening algorithm, and determining a first reference sentence similar to the sentence to be modified.

In the embodiment of the disclosure, the first filtering algorithm may be a TextRank algorithm, each sentence in the reference text is filtered, a sentence similar to the sentence to be modified is selected, and the sentence is used as the first reference sentence similar to the sentence to be modified. For example, the similarity between each sentence in the reference text and the sentence to be modified may be determined, and the sentence with the similarity greater than 50% may be used as the first reference sentence.

And (4) screening the first reference sentence by using a second screening algorithm, and taking the first reference sentence with the similarity meeting a preset threshold value as the target reference sentence.

In the embodiment of the present disclosure, the second filtering algorithm may be a BERT algorithm, and the first reference sentences whose similarities satisfy a preset threshold are used as the target reference sentences by respectively calculating the similarities between the first reference sentences and the sentences to be modified, for example, the first reference sentences whose similarities are greater than 80% may be used as the target reference sentences.

In the embodiment of the disclosure, the sentence to be modified in the text to be processed is determined, and the reference text is screened according to the preset text screening algorithm and the sentence to be modified, so as to determine the target reference sentence similar to the sentence to be modified.

and a substep S21 of obtaining a plurality of sample sentences.

In the embodiment of the present disclosure, the sample sentence may be a sentence extracted from a preset sample library, may also be a sentence downloaded from the internet, and may also be a sentence designated by a user, which is not limited by the present disclosure. For example, a sentence can be randomly cut from the academic journal as a sample sentence.

And a substep S22, translating the sample sentence according to a preset translation method to obtain a sample translation sentence.

In the embodiment of the disclosure, the preset translation method may be forward translation and reverse translation of different languages for the sample sentence, for example, the sample sentence is french, the sample sentence may be subjected to french-to-english translation to obtain an english sample sentence, the english sample sentence is subjected to english-to-english translation to obtain a french sample sentence, and the french sample sentence is used as the sample translation sentence.

For example, fig. 2 is a schematic diagram of text processing according to an embodiment of the present disclosure, as shown in fig. 2, a sentence 1 is a sample sentence, an english sentence 1 is translated into an english translation to obtain a chinese sentence 2, and then the chinese sentence 2 is translated into a chinese translation to obtain an english sentence 3, so that the sentence 3 can be used as a sample translation sentence.

And a substep S23, using the sample sentence and the sample translation sentence as a training sample pair.

For example, the sample sentence and the sample translation sentence may be used as a training sample pair, where a japanese sentence, a japanese sentence translated by japanese translation and english translation are used as the training sample pair, or a german sentence, a german sentence translated by german translation and french translation are used as the training sample pair.

And a substep S24 of training the initial sentence conversion model by using the training sample pair to obtain the target sentence conversion model.

In this embodiment of the present disclosure, the initial sentence conversion model may be a transform model for text processing, and in a specific implementation, the specific processing steps of the transform model may be as follows: the first step is as follows: acquiring a representation vector X of each word of the input sentence, wherein X can be obtained by adding a vector (Embedding) of the word and a vector of a word position; the second step is that: the obtained word expression vector matrix may be transmitted to an Encoder, and after N (e.g., 6) coding blocks (Encoder blocks), an encoding information matrix C of all words in a sentence may be obtained, where the word vector matrix may be represented by X (N × d), N may be the number of words in the sentence, and d may be a dimension representing a vector, for example, d =512, where a matrix dimension output by each coding block may be completely consistent with an input; the third step: the encoding information matrix C output by the encoder may be transferred to a decoder, the decoder may translate the next word i +1 according to the currently translated words 1 to i in sequence, and in the using process, when translating to the word i +1, the word after i +1 may be covered by a mask operation, and the predicted word is output.

It should be noted that, since the Transformer model may include two parts, namely, an Encoder (Encoder) and a Decoder (Decoder), wherein both the Encoder and the Decoder may include 6 blocks (blocks), wherein an encoded block (Encoder block) may include a Multi-Head Attention mechanism (Multi-Head Attention), and a decoded block (Decoder block) may include two Multi-Head Attention mechanisms (Multi-Head Attention), wherein one Multi-Head Attention mechanism may be used for masking (Masked) operation. Before decoding the multi-headed attention mechanism of the block, a Residual and Normalization (Add & Norm) Layer can be included, wherein Add can represent Residual Connection (Residual Connection) for preventing network degradation, and Norm can represent Layer Normalization (Layer Normalization) for normalizing the activation value of each Layer. By inputting the corpus 1 as a sample sentence and the corpus 2 as a sample translation sentence, the transform model can be continuously and iteratively trained to learn the conversion mode between the corpus 1 and the corpus 2, so that the sentence can be converted according to the target expression mode on the premise that the expression content is not changed.

Further, a multi-head Attention mechanism may be composed of a plurality of Self-Attention mechanisms (Self-Attention), and different from the encoder, in the decoder, the Self-Attention layer (Self-Attention layer) will only focus on the preceding information and will be referred to as mask (mask) operation, the inference of the following word (token) in the decoder may be based on the preceding word (token), and the decoder may not know the word (token) at time t +1, so that, in order to ensure the consistency between training and reasoning, the training is also to prevent the word (token) from being inconsistent with the word (token) after the training, and Attention allocation (Attention) is needed. Finally, the Decoder stack (Decoder stack) may be circumscribed by a Linear (Linear) layer and a normalization (Softmax) layer, and may correspond the vectors to the stylized output words. The linear layer may be a full connection layer, and then is externally connected with a Softmax layer, so that the probability of converting each input word (style 1) into each word in another character style (style 2) can be obtained.

It should be noted that the Transformer model in the embodiment of the present disclosure may further include a multi-head attention mechanism (Multihead attention), in the second multi-head attention mechanism, the previous encoder may provide a key value (K) and a value (V) matrix, and the output may provide a query (Q) matrix. The multi-head attention mechanism module may be different from the single-head attention mechanism in that the multi-head attention mechanism may generate a plurality of Q, K, and V matrices, and different Q, K, and V expand the characterization capability of the model, where the transform model in the embodiment of the present disclosure may include 8 attention mechanisms.

In the embodiment of the present disclosure, after the initial sentence conversion model is trained by using the training samples, the target sentence conversion model may be obtained, that is, the target sentence conversion model may be a pre-trained sentence conversion model. The target sentence conversion model may also be a sentence conversion model which is continuously updated according to the use of the model by the user on the basis of the pre-trained sentence conversion model, specifically, in the process of converting the sentence to be modified according to the target reference sentence by using the pre-trained sentence conversion model later to obtain the target recommended sentence corresponding to the sentence to be modified, the pre-trained sentence conversion model may be trained and learned according to the corpus input by the user, so that the sentence conversion model may be continuously updated. Therefore, by learning the corpora input in actual use, the training sample of the model can be enlarged, the result output by the model can better meet the requirement of a user, and the error rate of the output result can be reduced.

and a substep S25 of setting the position embedding parameter in the initial sentence conversion model to a trainable value so as to carry out sample training on the position embedding parameter.

In the embodiment of the disclosure, because the sentence to be modified is converted according to the target reference sentence to obtain the target recommended sentence corresponding to the sentence to be modified, and then the sentence to be modified is converted, not only synonymous replacement of words but also adjustment of the word order of the expressed content may be performed, and the position embedding parameter in the initial sentence conversion model is often set to a fixed value and does not participate in training.

Fig. 3 is a block diagram of a text processing apparatus provided in an embodiment of the present disclosure, and as shown in fig. 3, the apparatus 30 may include:

a first obtaining module 301, configured to obtain an input text to be processed;

a selecting module 302, configured to select, from a preset text library, a reference text that matches the text to be processed and meets preset requirements;

a first determining module 303, configured to determine, in the reference text, a target reference sentence similar to a sentence to be modified in the text to be processed;

and the conversion module 304 is configured to convert the sentence to be modified according to the target reference sentence according to a target sentence conversion model, so as to obtain a target recommended sentence corresponding to the sentence to be modified.

To sum up, the text processing apparatus provided in the embodiment of the present disclosure may obtain an input text to be processed, select a reference text that matches the text to be processed and meets a preset requirement from a preset text library, determine a target reference sentence similar to a sentence to be modified in the text to be processed in the reference text, and finally convert the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified. Therefore, the sentence to be modified is converted according to the target reference sentence through the sentence conversion model, and the target recommended sentence with more accurate words and more conforming to the habit of the examiner in the expression mode can be obtained, so that the text expression content and the expression mode can be adjusted, a high-quality written text can be obtained without manually modifying the text by a user, and the text processing efficiency is improved.

Optionally, a plurality of sample texts are stored in the preset text library;

the selecting module 302 is further configured to:

and taking sample texts which simultaneously belong to the first type of text and the second type of text as the reference text.

Optionally, the selecting module 302 is further configured to:

acquiring keywords in the text to be processed;

Optionally, the apparatus 30 further comprises:

and the storage module is used for respectively storing the text segments corresponding to the content attributes in the preset text library according to the content attributes.

Optionally, the first determining module 303 is further configured to:

determining a sentence to be modified in the text to be processed;

Optionally, the first determining module 303 is further configured to:

Optionally, the apparatus 30 further includes:

and the training module is used for training the initial sentence conversion model by utilizing the training sample pair to obtain the target sentence conversion model.

Optionally, the apparatus 30 further includes:

Optionally, the initial sentence conversion model is a Transformer model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

According to an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor, a memory for storing processor-executable instructions, wherein the processor is configured to implement the steps in the text processing method as in any of the embodiments described above when executed.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer readable storage medium, wherein instructions, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the steps of the text processing method as in any one of the above embodiments.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising readable program code which, when executed by a processor of a mobile terminal, enables the mobile terminal to perform the steps of the text processing method as in any one of the embodiments described above.

FIG. 4 is a block diagram illustrating an apparatus for text processing in accordance with an exemplary embodiment. For example, the apparatus 400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 400 may include one or more of the following components: processing components 402, memory 404, power components 406, multimedia components 408, audio components 410, input/output (I/O) interfaces 412, sensor components 414, and communication components 416.

The processing component 402 generally controls overall operation of the apparatus 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 402 may include one or more processors 420 to execute instructions to perform all or a portion of the steps of the text processing method described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 can include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

The memory 404 is configured to store various types of data to support operations at the device 400. Examples of such data include instructions for any application or method operating on the device 400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 404 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply components 406 provide power to the various components of device 400. The power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 400.

The multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 408 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 400 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 410 is configured to output and/or input audio signals. For example, audio component 410 includes a Microphone (MIC) configured to receive external audio signals when apparatus 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.

The I/O interface 412 provides an interface between the processing component 402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 414 includes one or more sensors for providing various aspects of status assessment for the apparatus 400. For example, the sensor component 414 may detect the open/closed state of the device 400, the relative positioning of components, such as a display and keypad of the apparatus 400, the sensor component 414 may also detect a change in position of the apparatus 400 or a component of the apparatus 400, the presence or absence of user contact with the apparatus 400, orientation or acceleration/deceleration of the apparatus 400, and a change in temperature of the apparatus 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is configured to facilitate wired or wireless communication between the apparatus 400 and other devices. The apparatus 400 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described text processing methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as memory 404 comprising instructions, executable by processor 420 of device 400 to perform the text processing method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

FIG. 5 is a block diagram illustrating another apparatus for text processing in accordance with an example embodiment. For example, the apparatus 500 may be provided as a server. Referring to fig. 5, the apparatus 500 includes a processing component 522 that further includes one or more processors and memory resources, represented by memory 532, for storing instructions, such as applications, that are executable by the processing component 522. The application programs stored in memory 532 may include one or more modules that each correspond to a set of instructions. Further, the processing component 522 is configured to execute instructions to perform the text processing methods described above.

The apparatus 500 may also include a power component 526 configured to perform power management of the apparatus 500, a wired or wireless network interface 550 configured to connect the apparatus 500 to a network, and an input/output (I/O) interface 558. The apparatus 500 may operate based on an operating system stored in the memory 532, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of text processing, the method comprising:

acquiring an input text to be processed;

selecting a reference text which is matched with the text to be processed and meets the preset requirement from a preset text library;

and converting the sentence to be modified according to the target reference sentence according to a target sentence conversion model to obtain a target recommended sentence corresponding to the sentence to be modified.

2. The method according to claim 1, wherein a plurality of sample texts are stored in the preset text library;

selecting a reference text which is matched with the text to be processed and meets preset requirements from a preset text library, wherein the reference text comprises the following steps:

3. The method according to claim 2, wherein the determining the target domain to which the text to be processed belongs comprises:

acquiring keywords in the text to be processed;

4. The method of claim 2, further comprising:

5. The method according to claim 1, wherein the determining, in the reference text, a target reference sentence similar to a sentence to be modified in the text to be processed comprises:

determining a sentence to be modified in the text to be processed;

6. The method according to claim 5, wherein the screening the reference text according to the sentence to be modified and a preset text screening algorithm to determine a target reference sentence similar to the sentence to be modified comprises:

7. The method according to any one of claims 1 to 6, further comprising:

obtaining a plurality of sample sentences;

8. The method of claim 7, further comprising:

9. The method of claim 7, wherein the initial sentence conversion model is a Transformer model.

10. A text processing apparatus, characterized in that the apparatus comprises:

the first determining module is used for determining a target reference sentence similar to the sentence to be modified in the text to be processed in the reference text;

11. The apparatus of claim 10, wherein the predetermined text library has a plurality of sample texts stored therein;

the selecting module is further configured to:

12. The apparatus of claim 11, wherein the selecting module is further configured to:

acquiring keywords in the text to be processed;

13. The apparatus of claim 11, further comprising:

14. The apparatus of claim 10, wherein the first determining module is further configured to:

determining a sentence to be modified in the text to be processed;

15. The apparatus of claim 14, wherein the first determining module is further configured to:

16. The apparatus of any one of claims 10 to 15, further comprising:

17. The apparatus of claim 16, further comprising:

18. The apparatus of claim 16, wherein the initial sentence conversion model is a Transformer model.

19. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the text processing method of any of claims 1 to 9.

20. A non-transitory computer readable storage medium, instructions in which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the text processing method of any one of claims 1 to 9.